| Title: | Datasets and Basic Statistics for Symbolic Data Analysis |
|---|---|
| Description: | Collects a diverse range of symbolic data and offers a comprehensive set of functions that facilitate the conversion of traditional data into the symbolic data format. |
| Authors: | Po-Wei Chen [aut], Chun-houh Chen [aut], Han-Ming Wu [cre] |
| Maintainer: | Han-Ming Wu <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.2.5 |
| Built: | 2026-05-21 10:52:49 UTC |
| Source: | https://github.com/hanmingwu1103/datasda |
Interval-valued dataset of 24 units from the UCI Abalone dataset,
aggregated by sex and age group. iGAP format (comma-separated interval
strings). See abalone.int for the Min-Max column format.
data(abalone.iGAP)data(abalone.iGAP)
A data frame with 24 observations (e.g., F-10-12, M-4-6) and
7 character columns in iGAP format (comma-separated "min, max" strings):
Length: Shell length range.
Diameter: Shell diameter range.
Height: Shell height range.
Whole: Whole weight range.
Shucked: Shucked weight range.
Viscera: Viscera weight range.
Shell: Shell weight range.
Row names encode Sex-AgeGroup (e.g., F-10-12 = Female age 10–12).
| Sample size (n) | 24 |
| Variables (p) | 7 |
| Subject area | Marine biology |
| Symbolic format | Interval (iGAP) |
| Analytical tasks | Clustering, Visualization |
UCI Machine Learning Repository.
Kao, C.-H. et al. (2014). Exploratory data analysis of interval-valued symbolic data with matrix visualization. CSDA, 79, 14-29.
data(abalone.iGAP)data(abalone.iGAP)
Interval-valued dataset of 24 units from the UCI Abalone dataset,
aggregated by sex and age group. Min-Max column format (two columns per
variable). See abalone.iGAP for the iGAP format version.
data(abalone.int)data(abalone.int)
A data frame with 24 observations and 14 columns (7 interval variables
in _min/_max pairs):
Length_min, Length_max: Shell length range.
Diameter_min, Diameter_max: Shell diameter range.
Height_min, Height_max: Shell height range.
Whole_min, Whole_max: Whole weight range.
Shucked_min, Shucked_max: Shucked weight range.
Viscera_min, Viscera_max: Viscera weight range.
Shell_min, Shell_max: Shell weight range.
Row names encode Sex-AgeGroup (e.g., F-10-12 = Female age 10–12).
| Sample size (n) | 24 |
| Variables (p) | 14 |
| Subject area | Marine biology |
| Symbolic format | Interval |
| Analytical tasks | Clustering, Visualization |
UCI Machine Learning Repository.
Kao, C.-H. et al. (2014). Exploratory data analysis of interval-valued symbolic data with matrix visualization. CSDA, 79, 14-29.
data(abalone.int)data(abalone.int)
Interval-valued acid rain pollution indices for sulphates and nitrates (kg/hectares) for 2 US states (Massachusetts and New York).
data(acid_rain.int)data(acid_rain.int)
A data frame with 2 observations and 5 variables in Min-Max format:
state: State name (character).
sulphate_l, sulphate_u: Sulphate pollution index range
(kg/hectares).
nitrate_l, nitrate_u: Nitrate pollution index range
(kg/hectares).
| Sample size (n) | 2 |
| Variables (p) | 5 |
| Subject area | Environment |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.21.
data(acid_rain.int)data(acid_rain.int)
Interval-valued dataset of 7 age-group observations with cholesterol and weight measurements. Each observation aggregates individuals in a 10-year age band with interval ranges for cholesterol and weight.
data(age_cholesterol_weight.int)data(age_cholesterol_weight.int)
A symbolic data frame (symbolic_tbl) with 7 observations and 4 variables:
Age: Age range (years, interval).
Cholesterol: Cholesterol level range (mg/dL, interval).
Weight: Weight range (pounds, interval).
n: Number of individuals in the age group (numeric).
| Sample size (n) | 7 |
| Variables (p) | 4 |
| Subject area | Medical |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics, Regression |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.
data(age_cholesterol_weight.int)data(age_cholesterol_weight.int)
Histogram-valued dataset of 229 countries with 3 population age pyramid histograms (both sexes, male, female). Each histogram has 21 age bins representing the distribution of the population across age groups.
data(age_pyramids.hist)data(age_pyramids.hist)
A data frame with 229 observations (countries) and 3 histogram-valued variables:
Both.Sexes.Population: Histogram of total population
by age group.
Male.Population: Histogram of male population by age group.
Female.Population: Histogram of female population by age group.
Row names are country names (e.g., WORLD, Afghanistan, Albania).
| Sample size (n) | 229 |
| Variables (p) | 3 |
| Subject area | Demographics |
| Symbolic format | Histogram |
| Analytical tasks | Clustering, Descriptive statistics |
HistDAWass R package (Age_Pyramids_2014 dataset).
Irpino, A. and Verde, R. (2015). Basic statistics for distributional symbolic variables: A new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.
Original data from the HistDAWass R package (Age_Pyramids_2014).
data(age_pyramids.hist)data(age_pyramids.hist)
Aggregate tabular numerical data (n by p) into interval-valued or histogram-valued symbolic data (K by p) based on a grouping mechanism.
aggregate_to_symbolic(x, type = "int", group_by = "kmeans", stratify_var = NULL, K = 5, interval = "range", quantile_probs = c(0.05, 0.95), bins = 10, nK = NULL)aggregate_to_symbolic(x, type = "int", group_by = "kmeans", stratify_var = NULL, K = 5, interval = "range", quantile_probs = c(0.05, 0.95), bins = 10, nK = NULL)
x |
A data.frame with n rows and p columns. May contain non-numeric columns used for grouping or stratification; only numeric columns are aggregated. |
type |
Output symbolic type: |
group_by |
Grouping mechanism. One of:
|
stratify_var |
Optional column name or index for a stratification
variable. When provided, grouping and aggregation are performed
independently within each level. Default is |
K |
Number of groups for clustering ( |
interval |
Interval construction method when |
quantile_probs |
Numeric vector of length 2 giving the lower and upper
quantile probabilities for |
bins |
Number of histogram bins when |
nK |
Number of observations to sample per group when
|
The function aggregates classical tabular data into symbolic data by:
Partitioning observations into groups via group_by
(clustering, resampling, or a categorical variable).
Within each group, summarizing each numeric variable as an interval (min/max or quantiles) or a histogram.
When stratify_var is provided, grouping and aggregation are performed
within each level of the stratification variable. Label values are prefixed
by the stratum name (e.g., "setosa.cluster_1").
For type = "hist", bin boundaries are computed from the global data
range to ensure comparability across groups.
Non-numeric columns (other than those used for grouping or stratification) are silently excluded from aggregation.
For type = "int": a symbolic_tbl (RSDA format) with
a label column followed by symbolic_interval columns for each
numeric variable (K rows, 1 + p columns).
For type = "hist": a MatH object
(K rows by p columns of histogram-valued data).
# Group by a categorical variable -> interval data res1 <- aggregate_to_symbolic(iris, type = "int", group_by = "Species") res1 # K-means clustering -> interval data res2 <- aggregate_to_symbolic(iris[, 1:4], type = "int", group_by = "kmeans", K = 3) # Quantile-based intervals res3 <- aggregate_to_symbolic(iris[, 1:4], type = "int", group_by = "kmeans", K = 3, interval = "quantile", quantile_probs = c(0.1, 0.9)) # Resampling -> interval data set.seed(42) res4 <- aggregate_to_symbolic(iris[, 1:4], type = "int", group_by = "resampling", K = 5, nK = 30) # Histogram aggregation res5 <- aggregate_to_symbolic(iris, type = "hist", group_by = "Species", bins = 5) # Hierarchical clustering -> interval data res6 <- aggregate_to_symbolic(iris[, 1:4], type = "int", group_by = "hclust", K = 3) # Stratified aggregation res7 <- aggregate_to_symbolic(iris, type = "int", group_by = "kmeans", K = 2, stratify_var = "Species")# Group by a categorical variable -> interval data res1 <- aggregate_to_symbolic(iris, type = "int", group_by = "Species") res1 # K-means clustering -> interval data res2 <- aggregate_to_symbolic(iris[, 1:4], type = "int", group_by = "kmeans", K = 3) # Quantile-based intervals res3 <- aggregate_to_symbolic(iris[, 1:4], type = "int", group_by = "kmeans", K = 3, interval = "quantile", quantile_probs = c(0.1, 0.9)) # Resampling -> interval data set.seed(42) res4 <- aggregate_to_symbolic(iris[, 1:4], type = "int", group_by = "resampling", K = 5, nK = 30) # Histogram aggregation res5 <- aggregate_to_symbolic(iris, type = "hist", group_by = "Species", bins = 5) # Hierarchical clustering -> interval data res6 <- aggregate_to_symbolic(iris[, 1:4], type = "int", group_by = "hclust", K = 3) # Stratified aggregation res7 <- aggregate_to_symbolic(iris, type = "int", group_by = "kmeans", K = 2, stratify_var = "Species")
Histogram-valued dataset of 16 airlines flying into JFK Airport.
Six variables (Flight Time, Taxi In, Arrival Delay, Taxi Out,
Departure Delay, Weather Delay) recorded as frequency distributions.
This is the wide (flat table) format; see airline_flights2.modal
for the modal-valued version.
data(airline_flights.hist)data(airline_flights.hist)
A data frame with 16 observations (Airline1–Airline16) and 17 numeric columns representing 6 histogram variables in wide format:
Flight Time(<120), Flight Time([120, 220]),
Flight Time(>220): Flight time distribution (3 bins).
Taxi In(<4), Taxi In([4, 10]),
Taxi In(>10): Taxi-in time distribution (3 bins).
Arrival Delay(<0), Arrival Delay([0, 60]),
Arrival Delay(>60): Arrival delay distribution (3 bins).
Taxi Out(<16), Taxi Out([16, 30]),
Taxi Out(>30): Taxi-out time distribution (3 bins).
Departure Delay(<0), Departure Delay([0, 60]),
Departure Delay(>60): Departure delay distribution (3 bins).
Weather Delay(No), Weather Delay(Yes):
Weather delay distribution (2 bins).
| Sample size (n) | 16 |
| Variables (p) | 17 |
| Subject area | Transportation |
| Symbolic format | Histogram |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.7.
data(airline_flights.hist)data(airline_flights.hist)
Modal-valued version of the airline flights dataset.
See airline_flights.hist for the wide-format version.
data(airline_flights2.modal)data(airline_flights2.modal)
A symbolic data frame (symbolic_tbl) with 16 observations and
6 modal-valued variables:
FlightTime: Modal distribution over flight time bins.
TaxiIn: Modal distribution over taxi-in time bins.
ArrivalDelay: Modal distribution over arrival delay bins.
TaxiOut: Modal distribution over taxi-out time bins.
DepartureDelay: Modal distribution over departure delay bins.
WeatherDelay: Modal distribution over weather delay bins.
| Sample size (n) | 16 |
| Variables (p) | 6 |
| Subject area | Transportation |
| Symbolic format | Modal |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.7.
data(airline_flights2.modal)data(airline_flights2.modal)
Convert a 3-dimensional array [n, p, 2] to iGAP format
(data.frame with comma-separated interval values).
ARRAY_to_iGAP(data)ARRAY_to_iGAP(data)
data |
A numeric array of dimension |
A data.frame in iGAP format with comma-separated "min,max"
values.
x <- array(NA, dim = c(4, 3, 2)) x[,,1] <- matrix(c(1,2,3,4, 5,6,7,8, 9,10,11,12), nrow = 4) x[,,2] <- matrix(c(3,5,6,7, 8,9,10,12, 13,15,16,18), nrow = 4) dimnames(x) <- list(paste0("obs_", 1:4), c("V1","V2","V3"), c("min","max")) igap <- ARRAY_to_iGAP(x) igapx <- array(NA, dim = c(4, 3, 2)) x[,,1] <- matrix(c(1,2,3,4, 5,6,7,8, 9,10,11,12), nrow = 4) x[,,2] <- matrix(c(3,5,6,7, 8,9,10,12, 13,15,16,18), nrow = 4) dimnames(x) <- list(paste0("obs_", 1:4), c("V1","V2","V3"), c("min","max")) igap <- ARRAY_to_iGAP(x) igap
Convert a 3-dimensional array [n, p, 2] to MM format
(data.frame with paired _min/_max columns).
ARRAY_to_MM(data)ARRAY_to_MM(data)
data |
A numeric array of dimension |
A data.frame with 2p columns (paired _min/_max).
x <- array(NA, dim = c(4, 3, 2)) x[,,1] <- matrix(c(1,2,3,4, 5,6,7,8, 9,10,11,12), nrow = 4) x[,,2] <- matrix(c(3,5,6,7, 8,9,10,12, 13,15,16,18), nrow = 4) dimnames(x) <- list(paste0("obs_", 1:4), c("V1","V2","V3"), c("min","max")) mm <- ARRAY_to_MM(x) mmx <- array(NA, dim = c(4, 3, 2)) x[,,1] <- matrix(c(1,2,3,4, 5,6,7,8, 9,10,11,12), nrow = 4) x[,,2] <- matrix(c(3,5,6,7, 8,9,10,12, 13,15,16,18), nrow = 4) dimnames(x) <- list(paste0("obs_", 1:4), c("V1","V2","V3"), c("min","max")) mm <- ARRAY_to_MM(x) mm
Convert a 3-dimensional array [n, p, 2] to RSDA format
(symbolic_tbl with symbolic_interval columns).
ARRAY_to_RSDA(data)ARRAY_to_RSDA(data)
data |
A numeric array of dimension |
A symbolic_tbl with p symbolic_interval columns.
x <- array(NA, dim = c(4, 3, 2)) x[,,1] <- matrix(c(1,2,3,4, 5,6,7,8, 9,10,11,12), nrow = 4) x[,,2] <- matrix(c(3,5,6,7, 8,9,10,12, 13,15,16,18), nrow = 4) dimnames(x) <- list(paste0("obs_", 1:4), c("V1","V2","V3"), c("min","max")) rsda <- ARRAY_to_RSDA(x) rsdax <- array(NA, dim = c(4, 3, 2)) x[,,1] <- matrix(c(1,2,3,4, 5,6,7,8, 9,10,11,12), nrow = 4) x[,,2] <- matrix(c(3,5,6,7, 8,9,10,12, 13,15,16,18), nrow = 4) dimnames(x) <- list(paste0("obs_", 1:4), c("V1","V2","V3"), c("min","max")) rsda <- ARRAY_to_RSDA(x) rsda
Symbolic dataset of autoregressive time series models for 4 banks. Each bank is described by AR model order, parameters, and whether parameters are known.
data(bank_rates)data(bank_rates)
A data frame with 4 observations (Bank1–Bank4) and 6 variables:
bank: Bank identifier (character).
order: AR model order (numeric).
phi1: First AR parameter (numeric; NA if unknown).
phi2: Second AR parameter (numeric; NA if order < 2 or unknown).
phi1_known: Whether phi1 is known (logical).
phi2_known: Whether phi2 is known (logical; NA if order < 2).
| Sample size (n) | 4 |
| Variables (p) | 6 |
| Subject area | Finance |
| Symbolic format | Symbolic (model-valued) |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.9.
data(bank_rates)data(bank_rates)
Interval-valued data for 19 baseball teams with aggregated player batting statistics and a pattern variable classifying team performance.
data(baseball.int)data(baseball.int)
A symbolic data frame (symbolic_tbl) with 19 observations and 3 variables:
At_Bats: Range of at-bats across players (interval).
Hits: Range of hits across players (interval).
Pattern: Team performance pattern code (character).
| Sample size (n) | 19 |
| Variables (p) | 3 |
| Subject area | Sports |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics, Clustering |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.
data(baseball.int)data(baseball.int)
Interval-valued data for 21 bat species described by 4 morphological measurements. Benchmark dataset for matrix visualization.
data(bats.int)data(bats.int)
A data frame with 21 observations and 9 columns (4 interval variables
in _l/_u Min-Max pairs, plus a label):
species: Bat species name (character).
head_l, head_u: Head length range (mm).
tail_l, tail_u: Tail length range (mm).
height_l, height_u: Ear height range (cm).
forearm_l, forearm_u: Forearm length range (mm).
Used to demonstrate color coding schemes, the HCT-R2E seriation algorithm, and distance measure comparisons (Gowda-Diday, Hausdorff, City-Block, L1, L2, etc.) for interval data.
| Sample size (n) | 21 |
| Variables (p) | 9 |
| Subject area | Zoology |
| Symbolic format | Interval |
| Analytical tasks | Clustering, Visualization |
Kao, C.-H. et al. (2014). Exploratory data analysis of interval-valued symbolic data with matrix visualization. CSDA, 79, 14-29.
data(bats.int)data(bats.int)
Mixed symbolic dataset of 20 bird observations with histogram-valued feather density and body size, categorical tone, and distribution-valued shade (fuzzy taxonomy). From Tables 6.9 and 6.14 of Billard and Diday (2007).
data(bird_color_taxonomy.hist)data(bird_color_taxonomy.hist)
A data frame with 20 observations and 4 variables:
density: Histogram-valued feather density (up to 4 bins).
size: Histogram-valued body size (2-bin).
tone: Categorical tone (dark/light).
shade: Distribution-valued shade (purple/red/white/yellow
with fuzzy weights).
| Sample size (n) | 20 |
| Variables (p) | 4 |
| Subject area | Zoology |
| Symbolic format | Mixed (histogram, categorical, distribution) |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2007), Tables 6.9/6.14.
Billard, L. and Diday, E. (2007). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. Tables 6.9 and 6.14.
data(bird_color_taxonomy.hist)data(bird_color_taxonomy.hist)
Three bird species (Geese, Ostrich, Penguin) with interval-valued height, distribution-valued color, and categorical flying/migratory variables.
data(bird_species_extended.mix)data(bird_species_extended.mix)
A data frame with 3 observations and 6 variables:
species: Species name (character).
flying: Flying ability (Yes/No, character).
height_l: Height lower bound (cm, numeric).
height_u: Height upper bound (cm, numeric).
color: Color distribution as weighted set string
(e.g., "{white, 0.3; black, 0.7}").
migratory: Migratory behavior (Yes/No, character).
| Sample size (n) | 3 |
| Variables (p) | 6 |
| Subject area | Zoology |
| Symbolic format | Mixed (interval, categorical, distribution) |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.19.
data(bird_species_extended.mix)data(bird_species_extended.mix)
Symbolic data for 3 bird species (Swallow, Ostrich, Penguin) with interval-valued size, categorical flying, and categorical migration. Foundational SDA example from 600 individual bird observations.
data(bird_species.mix)data(bird_species.mix)
A data frame with 3 observations (Swallow, Ostrich, Penguin) and 5 variables:
species: Species name (character).
flying: Flying ability (Yes/No, character).
size_l, size_u: Size range (cm, Min-Max pair).
migration: Migratory behavior (TRUE/FALSE, logical).
| Sample size (n) | 3 |
| Variables (p) | 5 |
| Subject area | Zoology |
| Symbolic format | Mixed (interval, categorical) |
| Analytical tasks | Descriptive statistics |
Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Table 1.2, p.6.
data(bird_species.mix)data(bird_species.mix)
Interval-valued morphological measurements for 20 bird specimens.
Despite the .mix suffix, this dataset contains only
interval-valued variables (density and size).
data(bird.mix)data(bird.mix)
A symbolic data frame (symbolic_tbl) with 20 observations and 2 variables:
Density: Feather density range (interval).
Size: Body size range (cm, interval).
| Sample size (n) | 20 |
| Variables (p) | 2 |
| Subject area | Zoology |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.5.
data(bird.mix)data(bird.mix)
Interval-valued blood pressure and pulse rate measurements for 15 patient groups.
data(blood_pressure.int)data(blood_pressure.int)
A symbolic data frame (symbolic_tbl) with 15 observations and
3 interval-valued variables:
Pulse_Rate: Pulse rate range (beats per minute, interval).
Systolic_Pressure: Systolic blood pressure range (mmHg, interval).
Diastolic_Pressure: Diastolic blood pressure range (mmHg, interval).
| Sample size (n) | 15 |
| Variables (p) | 3 |
| Subject area | Medical |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics, Regression |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.
data(blood_pressure.int)data(blood_pressure.int)
Histogram-valued blood test results for 14 gender-age groups (e.g., Female-20, Male-50). Each observation contains histograms for cholesterol, hemoglobin, and hematocrit, represented as multi-bin distributions.
data(blood.hist)data(blood.hist)
A data frame with 14 observations and 3 histogram-valued variables:
Cholesterol: Histogram of cholesterol levels (mg/dL).
Hemoglobin: Histogram of hemoglobin levels (g/dL).
Hematocrit: Histogram of hematocrit levels (%).
| Sample size (n) | 14 |
| Variables (p) | 3 |
| Subject area | Medical |
| Symbolic format | Histogram |
| Analytical tasks | Descriptive statistics, Clustering |
HistDAWass R package (BLOOD dataset).
Irpino, A. and Verde, R. (2015). Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.
Original data from the HistDAWass R package (BLOOD dataset).
data(blood.hist)data(blood.hist)
Interval-valued specifications for 33 Italian car models, classified into 4 categories (Utilitaria, Berlina, Ammiraglia, Sportiva). An extended version of the classic cars interval dataset with 8 interval-valued variables including dimensions.
data(car_models.int)data(car_models.int)
A data frame with 33 observations and 9 variables:
price: Price range (currency units).
engine_cc: Engine displacement range (cc).
top_speed: Top speed range (km/h).
acceleration: Acceleration range (seconds 0-100 km/h).
wheelbase: Wheelbase range (cm).
length: Length range (cm).
width: Width range (cm).
height: Height range (cm).
class: Car category (Utilitaria, Berlina, Ammiraglia, Sportiva).
| Sample size (n) | 33 |
| Variables (p) | 9 |
| Subject area | Automotive |
| Symbolic format | Interval |
| Analytical tasks | Clustering, Classification |
https://github.com/Natandradesa/Kernel-Clustering-for-Interval-Data
Andrade, N. A., de Carvalho, F. A. T. and Pimentel, B. A. (2025). Kernel clustering with automatic variable weighting for interval data. Neurocomputing, 617, 128954.
data(car_models.int)data(car_models.int)
Interval-valued data for 8 car brands with price and performance specifications. Each brand aggregates multiple models into interval ranges.
data(car.int)data(car.int)
A symbolic data frame (symbolic_tbl) with 8 observations and 5 variables:
Car: Car brand name (character).
Price: Price range (thousands of currency units, interval).
Max_Velocity: Maximum velocity range (km/h, interval).
Accn_Time: Acceleration time range (seconds 0–100 km/h, interval).
Cylinder_Capacity: Engine cylinder capacity range (cc, interval).
| Sample size (n) | 8 |
| Variables (p) | 5 |
| Subject area | Automotive |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics, Clustering |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.
data(car.int)data(car.int)
Interval-valued data from cardiological examinations of 44 patients. Each patient is described by 5 interval-valued physiological measurements.
data(cardiological.int)data(cardiological.int)
A data frame with 44 observations and 5 interval-valued variables:
pulse: Pulse rate range (beats per minute).
systolic: Systolic blood pressure range (mmHg).
diastolic: Diastolic blood pressure range (mmHg).
arterial1: First arterial measurement range.
arterial2: Second arterial measurement range.
| Sample size (n) | 44 |
| Variables (p) | 5 |
| Subject area | Medical |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics, Clustering |
Extracted from RSDA package (cardiologicalv2).
Rodriguez, O. (2000). Classification et modeles lineaires en analyse des donnees symboliques. Doctoral Thesis, Universite Paris IX-Dauphine.
data(cardiological.int)data(cardiological.int)
Interval-valued data for 27 car models classified into four classes (Utilitarian, Berlina, Sportive, Luxury), described by Price, EngineCapacity, TopSpeed and Acceleration intervals.
data(cars.int)data(cars.int)
A symbolic data frame (symbolic_tbl) with 27 observations and 5 variables:
Price: Price range (interval).
EngCap: Engine capacity range (cc, interval).
TopSpeed: Top speed range (km/h, interval).
Acceleration: Acceleration range (seconds 0–100 km/h, interval).
class: Car class (Utilitarian, Berlina, Sportive, Luxury; set-valued).
| Sample size (n) | 27 |
| Variables (p) | 5 |
| Subject area | Automotive |
| Symbolic format | Interval |
| Analytical tasks | Classification |
https://CRAN.R-project.org/package=MAINT.Data
Duarte Silva, A.P., Brito, P., Filzmoser, P. and Dias, J.G. (2021). MAINT.Data: Modelling and Analysing Interval Data in R. R Journal, 13(2).
data(cars.int)data(cars.int)
Mixed symbolic dataset of 10 census regions combining 6 different symbolic variable types: histograms (age, home value), distributions (gender, tenure), a multi-valued set (fuel), and an interval (income).
data(census.mix)data(census.mix)
A symbolic data frame (symbolic_tbl) with 10 observations
(regions) and 6 variables:
age: Histogram-valued age distribution (12 age bins).
home_value: Histogram-valued home value distribution
(7 value bins, in $1000s).
gender: Distribution over gender (male, female).
fuel: Multi-valued set of fuel types used.
tenure: Distribution over housing tenure
(owner, renter, vacant).
income: Interval-valued household income range ($1000s).
Row names are Region_1 through Region_10.
| Sample size (n) | 10 |
| Variables (p) | 6 |
| Subject area | Demographics |
| Symbolic format | Mixed (interval, histogram, distribution, multi-valued) |
| Analytical tasks | Clustering |
Billard, L. and Diday, E. (2020), Table 7-23.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 7-23.
data(census.mix)data(census.mix)
Histogram-valued monthly climate data for 60 Chinese weather stations. Each station has 14 climate variables measured across 12 months (168 histogram columns total). Histograms are reduced to 10 decile bins from the original HistDAWass distributions.
data(china_climate_month.hist)data(china_climate_month.hist)
A data frame with 60 observations (stations) and 168
histogram-valued variables. Variables follow the pattern
variable_Month (e.g., mean.temp_Jan). The 14 climate
variables are: mean pressure, mean temperature, mean max/min
temperature, total precipitation, sunshine duration, mean cloud amount,
mean relative humidity, snow days, dominant wind direction, mean wind
speed, dominant wind frequency, extreme max/min temperature.
| Sample size (n) | 60 |
| Variables (p) | 168 |
| Subject area | Climate |
| Symbolic format | Histogram |
| Analytical tasks | Clustering |
HistDAWass R package (China_Month dataset).
Irpino, A. and Verde, R. (2015). Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.
Original data from the HistDAWass R package (China_Month dataset).
data(china_climate_month.hist)data(china_climate_month.hist)
Histogram-valued seasonal climate data for 60 Chinese weather stations. Each station has 14 climate variables measured across 4 seasons (56 histogram columns total). Histograms are reduced to 10 decile bins from the original HistDAWass distributions.
data(china_climate_season.hist)data(china_climate_season.hist)
A data frame with 60 observations (stations) and 56
histogram-valued variables. Variables follow the pattern
variable_Season (e.g., mean.temp_Spring). The 14 climate
variables are: mean pressure, mean temperature, mean max/min
temperature, total precipitation, sunshine duration, mean cloud amount,
mean relative humidity, snow days, dominant wind direction, mean wind
speed, dominant wind frequency, extreme max/min temperature.
| Sample size (n) | 60 |
| Variables (p) | 56 |
| Subject area | Climate |
| Symbolic format | Histogram |
| Analytical tasks | Clustering |
HistDAWass R package (China_Seas dataset).
Irpino, A. and Verde, R. (2015). Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.
Original data from the HistDAWass R package (China_Seas dataset).
data(china_climate_season.hist)data(china_climate_season.hist)
Interval-valued dataset of monthly temperature ranges for 15 weather stations in China. Each station has 12 monthly temperature intervals (minimum and maximum observed temperatures in degrees Celsius) and an elevation value in meters.
data(china_temp_monthly.int)data(china_temp_monthly.int)
A symbolic data frame (symbolic_tbl) with 15 observations
(weather stations) and 13 variables:
January, February, March, April,
May, June, July, August,
September, October, November, December:
Interval-valued monthly temperature ranges (degrees Celsius).
Elevation: Station elevation above sea level (numeric, meters).
Row names are station names (e.g., BoKeTu, Hailaer, LaSa).
| Sample size (n) | 15 |
| Variables (p) | 13 |
| Subject area | Climate |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
Billard, L. and Diday, E. (2020), Table 7-9.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 7-9.
data(china_temp_monthly.int)data(china_temp_monthly.int)
Interval-valued temperature data (Celsius) for 60 Chinese meteorological stations observed over the four quarters of years 1974 to 1988. One outlier observation (YinChuan_1982) has been discarded.
data(china_temp.int)data(china_temp.int)
A symbolic data frame (symbolic_tbl) with 899 observations and 5 variables:
Q1: Quarter 1 (Jan–Mar) temperature range (tenths of degrees Celsius, interval).
Q2: Quarter 2 (Apr–Jun) temperature range (interval).
Q3: Quarter 3 (Jul–Sep) temperature range (interval).
Q4: Quarter 4 (Oct–Dec) temperature range (interval).
GeoReg: Geographic region classification (factor).
Originates from the Long-Term Instrumental Climatic Database of the People's Republic of China. Widely used in the SDA literature for demonstrating standardization, clustering, self-organizing maps, MLE and MANOVA.
| Sample size (n) | 899 |
| Variables (p) | 5 |
| Subject area | Climate |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
https://CRAN.R-project.org/package=MAINT.Data
Brito, P. and Duarte Silva, A.P. (2012). Modelling interval data with Normal and Skew-Normal distributions. J. Appl. Stat., 39(1), 3-20.
Kao, C.-H. et al. (2014). Exploratory data analysis of interval-valued symbolic data with matrix visualization. CSDA, 79, 14-29.
data(china_temp.int)data(china_temp.int)
Histogram-valued cholesterol distributions for 14 gender-age groups (7 female + 7 male age groups from 20s to 80+). Each observation has a 10-bin histogram of cholesterol levels.
data(cholesterol.hist)data(cholesterol.hist)
A data frame with 14 observations and 3 variables:
gender: Gender (Female or Male).
age: Age group (20s, 30s, 40s, 50s, 60s, 70s, 80+).
cholesterol: Histogram-valued cholesterol distribution.
| Sample size (n) | 14 |
| Variables (p) | 3 |
| Subject area | Medical |
| Symbolic format | Histogram |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006), Table 4.5.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. Table 4.5.
data(cholesterol.hist)data(cholesterol.hist)
This function is used to clean up variable names to conform to the RSDA format.
clean_colnames(data)clean_colnames(data)
data |
The conventional data. |
Data after cleaning variable names.
data(mushroom.int.mm) mushroom.clean <- clean_colnames(data = mushroom.int.mm)data(mushroom.int.mm) mushroom.clean <- clean_colnames(data = mushroom.int.mm)
Histogram-valued dataset of 12 counties with gender-stratified income histograms and sample sizes. Each county has a male income histogram, a female income histogram, and the number of respondents in each group.
data(county_income_gender.hist)data(county_income_gender.hist)
A data frame with 12 observations (counties) and 4 variables:
male_income: Histogram of male household income
(4 bins from $0 to $100k).
female_income: Histogram of female household income
(4 bins from $0 to $100k).
n_males: Number of male respondents (numeric).
n_females: Number of female respondents (numeric).
Row names are County_1 through County_12.
| Sample size (n) | 12 |
| Variables (p) | 4 |
| Subject area | Economics |
| Symbolic format | Histogram |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2020), Table 6-16.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 6-16.
data(county_income_gender.hist)data(county_income_gender.hist)
Histogram-valued dataset of 7 forest cover types with 4 topographic histogram variables. Each histogram describes the distribution of a terrain feature across locations classified as that cover type.
data(cover_types.hist)data(cover_types.hist)
A data frame with 7 observations (cover types) and 4 histogram-valued variables:
elevation: Histogram of elevation values (meters).
distance_to_water: Histogram of horizontal distance
to nearest water source (meters).
hillshade: Histogram of hillshade index values.
slope: Histogram of slope values (degrees).
Row names are CoverType_1 through CoverType_7.
| Sample size (n) | 7 |
| Variables (p) | 4 |
| Subject area | Forestry |
| Symbolic format | Histogram |
| Analytical tasks | Clustering, Classification |
Billard, L. and Diday, E. (2020), Table 7-21.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 7-21.
data(cover_types.hist)data(cover_types.hist)
Interval-valued credit card spending aggregated by person-month. Three individuals' (Jon, Tom, Leigh) monthly expenditures across five categories.
data(credit_card.int)data(credit_card.int)
A data frame with 6 observations and 11 columns (5 interval variables
in _l/_u Min-Max pairs, plus a label):
person_month: Person-month identifier (e.g., "Jon - January"; character).
food_l, food_u: Food expenditure range (USD).
social_l, social_u: Social expenditure range (USD).
travel_l, travel_u: Travel expenditure range (USD).
gas_l, gas_u: Gas expenditure range (USD).
clothes_l, clothes_u: Clothes expenditure range (USD).
The original classical dataset (Table 2.3) records individual transactions. The symbolic version (Table 2.4) aggregates into interval-valued observations for each person-month combination.
| Sample size (n) | 6 |
| Variables (p) | 11 |
| Subject area | Finance |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Tables 2.3-2.4.
data(credit_card.int)data(credit_card.int)
Modal-valued dataset of 15 gangs described by probability distributions
over crime type, gender, and age group. This is the wide (flat table)
format; see crime2.modal for the modal-valued version.
data(crime.modal)data(crime.modal)
A data frame with 15 observations (gang1–gang15) and 7 numeric columns representing 3 modal variables in wide format:
Crime(violent), Crime(non-violent), Crime(none):
Distribution over crime types (3 bins).
Gender(male), Gender(female):
Distribution over gender (2 bins).
Age(<20), Age(>=20):
Distribution over age groups (2 bins).
| Sample size (n) | 15 |
| Variables (p) | 7 |
| Subject area | Criminology |
| Symbolic format | Modal |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.
data(crime.modal)data(crime.modal)
Modal-valued version of the crime demographics dataset.
See crime.modal for the wide-format version.
data(crime2.modal)data(crime2.modal)
A symbolic data frame (symbolic_tbl) with 15 observations and
3 modal-valued variables:
Crime: Modal distribution over crime types
(violent, non-violent, none).
Gender: Modal distribution over gender (male, female).
Age: Modal distribution over age groups (<20, >=20).
| Sample size (n) | 15 |
| Variables (p) | 3 |
| Subject area | Criminology |
| Symbolic format | Modal |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.
data(crime2.modal)data(crime2.modal)
Daily high and low prices of WTI (West Texas Intermediate) crude oil futures from January 2, 2003 to December 30, 2011 (2261 trading days). This dataset matches the period used by Yang, Han, Hong and Wang (2016) for analyzing crisis impacts on crude oil prices using interval time series modelling.
data(crude_oil_wti.its)data(crude_oil_wti.its)
A data frame with 2261 observations and 3 variables:
date: Trading date (Date class).
low: Daily low price (USD per barrel).
high: Daily high price (USD per barrel).
WTI crude oil is a benchmark for oil prices in the Americas. This dataset covers a period that includes the 2003 Iraq War, the 2007–2008 oil price spike (reaching nearly USD 150/barrel), the 2008 global financial crisis, and the subsequent recovery. The wide variation in price levels and volatility regimes makes this dataset ideal for evaluating interval time series models under structural breaks.
| Sample size (n) | 2261 |
| Variables (p) | 3 (date, low, high) |
| Subject area | Finance / Commodities |
| Symbolic format | Interval time series |
| Analytical tasks | Forecasting, Structural break analysis |
Yahoo Finance, ticker CL=F. Downloaded via the
quantmod package.
Yang, W., Han, A., Hong, Y. and Wang, S. (2016). Analysis of crisis impact on crude oil prices: A new approach with interval time series modelling. Quantitative Finance, 16(12), 1917–1928.
data(crude_oil_wti.its) head(crude_oil_wti.its) plot(crude_oil_wti.its$date, crude_oil_wti.its$high, type = "l", col = "red", ylab = "Price (USD/barrel)", xlab = "Date", main = "WTI Crude Oil Daily High/Low (2003-2011)") lines(crude_oil_wti.its$date, crude_oil_wti.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)data(crude_oil_wti.its) head(crude_oil_wti.its) plot(crude_oil_wti.its$date, crude_oil_wti.its$high, type = "l", col = "red", ylab = "Price (USD/barrel)", xlab = "Date", main = "WTI Crude Oil Daily High/Low (2003-2011)") lines(crude_oil_wti.its$date, crude_oil_wti.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)
Daily high and low prices of the Dow Jones Industrial Average (DJIA) from January 2, 2004 to December 30, 2005 (504 trading days). This dataset matches the period used in the foundational interval time series work by Arroyo, Gonzalez-Rivera and Mate (2011).
data(djia.its)data(djia.its)
A data frame with 504 observations and 3 variables:
date: Trading date (Date class).
low: Daily low price of the DJIA.
high: Daily high price of the DJIA.
The DJIA is a price-weighted index of 30 prominent companies listed on stock exchanges in the United States. Each observation represents a trading day with the daily low and high prices forming an interval. This dataset has been used alongside the S&P 500 to compare interval forecasting methods.
| Sample size (n) | 504 |
| Variables (p) | 3 (date, low, high) |
| Subject area | Finance |
| Symbolic format | Interval time series |
| Analytical tasks | Forecasting, Time series analysis |
Yahoo Finance, ticker ^DJI. Downloaded via the
quantmod package.
Arroyo, J., Gonzalez-Rivera, G. and Mate, C. (2011). Forecasting with interval and histogram data: Some financial applications. In Handbook of Empirical Economics and Finance, pp. 247–280. Chapman and Hall/CRC.
data(djia.its) head(djia.its) plot(djia.its$date, djia.its$high, type = "l", col = "red", ylab = "Price", xlab = "Date", main = "DJIA Daily High/Low") lines(djia.its$date, djia.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)data(djia.its) head(djia.its) plot(djia.its$date, djia.its$high, type = "l", col = "red", ylab = "Price", xlab = "Date", main = "DJIA Daily High/Low") lines(djia.its$date, djia.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)
Interval-valued dataset of 9 E. coli transport routes with 5 interval variables representing biochemical pathway measurements.
data(ecoli_routes.int)data(ecoli_routes.int)
A symbolic data frame (symbolic_tbl) with 9 observations
(transport routes) and 5 interval-valued variables:
Y1 through Y5: Interval-valued biochemical
pathway measurements.
Row names are Route_1 through Route_9.
| Sample size (n) | 9 |
| Variables (p) | 5 |
| Subject area | Biology |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
Billard, L. and Diday, E. (2020), Table 8-10.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 8-10.
data(ecoli_routes.int)data(ecoli_routes.int)
Interval-valued proportions for 12 sex-age population groups across employment variables (employment type, education, industry sector, occupation, marital status). Used for factorial discriminant analysis.
data(employment.int)data(employment.int)
A data frame with 12 observations and 20 columns (9 interval variables
in _l/_u Min-Max pairs, plus a group label and class):
group: Sex-age group identifier (character).
full_time_l, full_time_u: Full-time employment proportion range.
part_time_l, part_time_u: Part-time employment proportion range.
primary_studies_l, primary_studies_u: Primary studies proportion range.
secondary_studies_l, secondary_studies_u: Secondary studies proportion range.
uni_studies_l, uni_studies_u: University studies proportion range.
employee_l, employee_u: Employee proportion range.
manufacturing_l, manufacturing_u: Manufacturing sector proportion range.
construction_l, construction_u: Construction sector proportion range.
wholesale_retail_l, wholesale_retail_u: Wholesale/retail proportion range.
class: Group classification (numeric).
| Sample size (n) | 12 |
| Variables (p) | 20 |
| Subject area | Economics |
| Symbolic format | Interval |
| Analytical tasks | Discriminant analysis, Classification |
Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Table 18.1.
data(employment.int)data(employment.int)
Distribution-valued dataset of energy consumption across US states. Each energy type described by Normal distribution parameters (mean, SD).
data(energy_consumption.distr)data(energy_consumption.distr)
A data frame with 5 observations and 3 variables:
type: Energy type.
mean: Mean consumption across 50 states.
sd: Standard deviation.
Five types: Petroleum, Natural Gas, Coal, Hydroelectric, Nuclear Power. Values are rescaled consumption from the US Census Bureau (2004).
| Sample size (n) | 5 |
| Variables (p) | 3 |
| Subject area | Energy |
| Symbolic format | Distribution |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.8.
data(energy_consumption.distr)data(energy_consumption.distr)
Distribution-valued dataset for 10 towns (geographic areas) with categorical probability distributions for fuel type and central heating. Each observation has two distribution-valued variables.
data(energy_usage.distr)data(energy_usage.distr)
A data frame with 10 observations and 2 distribution-valued variables:
fuel_type: Distribution over fuel types
(None, Gas, Oil, Electricity, Coal).
central_heating: Distribution over central heating
(No, Yes).
Row names are Town_1 through Town_10.
| Sample size (n) | 10 |
| Variables (p) | 2 |
| Subject area | Energy |
| Symbolic format | Distribution |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006), Table 3.7.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. Table 3.7.
data(energy_usage.distr)data(energy_usage.distr)
Mixed symbolic dataset from the US EPA with 14 state-group observations and 17 variables of mixed types: interval-valued environmental measurements and modal-valued (distributional) categorical variables.
data(environment.mix)data(environment.mix)
A symbolic data frame (symbolic_tbl) with 14 observations and
17 variables:
URBANICITY: Modal-valued urbanicity distribution (character).
INCOMELEVEL: Modal-valued income level distribution (character).
EDUCATION: Modal-valued education distribution (character).
REGIONDEVELOPME: Modal-valued regional development distribution (character).
CONTROL: Environmental control index range (interval).
SATISFY: Satisfaction index range (interval).
INDIVIDUAL: Individual concern index range (interval).
WELFARE: Welfare index range (interval).
HUMAN: Human impact index range (interval).
POLITICS: Political concern index range (interval).
BURDEN: Burden index range (interval).
NOISE: Noise pollution index range (interval).
NATURE: Nature preservation index range (interval).
SEASETC: Seas/coastal index range (interval).
MULTI: Multi-indicator range (interval).
WATERWASTE: Water/waste index range (interval).
VEHICLE: Vehicle emissions index range (interval).
| Sample size (n) | 14 |
| Variables (p) | 17 |
| Subject area | Environment |
| Symbolic format | Mixed (interval, modal) |
| Analytical tasks | Descriptive statistics, Clustering |
Extracted from ggESDA package (Environment).
Sun, Y. and Billard, L. (2020). Symbolic data analysis with the ggESDA package. Journal of Statistical Software.
data(environment.mix)data(environment.mix)
Daily high and low values of the EUR/USD exchange rate from January 1, 2004 to December 30, 2005 (520 trading days). Inspired by the dataset used by Arroyo, Espinola and Mate (2011) for exponential smoothing methods for interval time series.
data(euro_usd.its)data(euro_usd.its)
A data frame with 520 observations and 3 variables:
date: Trading date (Date class).
low: Daily low EUR/USD exchange rate.
high: Daily high EUR/USD exchange rate.
The EUR/USD exchange rate is the most traded currency pair in the world foreign exchange market. Each observation represents a trading day with the daily low and high exchange rates (USD per EUR) forming an interval. Note: the original study by Arroyo et al. (2011) used the period 2002–2003 (519 trading days); this dataset covers 2004–2005 because Yahoo Finance historical data for this ticker is only available from late 2003 onward.
| Sample size (n) | 520 |
| Variables (p) | 3 (date, low, high) |
| Subject area | Finance / Foreign Exchange |
| Symbolic format | Interval time series |
| Analytical tasks | Forecasting, Time series analysis |
Yahoo Finance, ticker EURUSD=X. Downloaded via the
quantmod package.
Arroyo, J., Espinola, R. and Mate, C. (2011). Different approaches to forecast interval time series: A comparison in finance. Computational Economics, 37(2), 169–191.
data(euro_usd.its) head(euro_usd.its) plot(euro_usd.its$date, euro_usd.its$high, type = "l", col = "red", ylab = "EUR/USD", xlab = "Date", main = "EUR/USD Daily High/Low (2004-2005)") lines(euro_usd.its$date, euro_usd.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)data(euro_usd.its) head(euro_usd.its) plot(euro_usd.its$date, euro_usd.its$high, type = "l", col = "red", ylab = "EUR/USD", xlab = "Date", main = "EUR/USD Daily High/Low (2004-2005)") lines(euro_usd.its$date, euro_usd.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)
Histogram-valued time series of 108 monthly observations of daily exchange rate returns. Each observation is a histogram distribution of intra-month daily returns.
data(exchange_rate_returns.hist)data(exchange_rate_returns.hist)
A data frame with 108 observations and 1 histogram-valued variable:
returns: Histogram of daily exchange rate returns
within each month.
| Sample size (n) | 108 |
| Variables (p) | 1 |
| Subject area | Finance |
| Symbolic format | Histogram |
| Analytical tasks | Time series, Descriptive statistics |
HistDAWass R package (RetHTS dataset).
Irpino, A. and Verde, R. (2015). Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.
Original data from the HistDAWass R package (RetHTS dataset).
data(exchange_rate_returns.hist)data(exchange_rate_returns.hist)
Interval-valued facial measurement data for 27 face images (9 individuals x 3 replications) in iGAP format (comma-separated interval strings). Contains 6 distance measurements between facial landmarks.
data(face.iGAP)data(face.iGAP)
A data frame with 27 observations and 6 character columns in iGAP
format (comma-separated "min,max" strings):
AD: Distance AD (facial landmark pair).
BC: Distance BC (facial landmark pair).
AH: Distance AH (facial landmark pair).
DH: Distance DH (facial landmark pair).
EH: Distance EH (facial landmark pair).
GH: Distance GH (facial landmark pair).
Row names encode individual and replication (e.g., FRA1, FRA2, FRA3).
| Sample size (n) | 27 |
| Variables (p) | 6 |
| Subject area | Biometrics |
| Symbolic format | Interval (iGAP) |
| Analytical tasks | Classification, Visualization |
Kao, C.-H. et al. (2014). Exploratory data analysis of interval-valued symbolic data with matrix visualization. CSDA, 79, 14-29.
data(face.iGAP)data(face.iGAP)
Interval-valued data for 14 business sectors described by job-related financial variables (job cost codes, activity codes, budgets). Used for PCA demonstrations.
data(finance.int)data(finance.int)
A symbolic data frame (symbolic_tbl) with 14 observations and 7 variables:
Sector: Business sector name (character).
Job_Cost: Job cost range (currency units, interval).
Job_Code: Job code range (interval).
Activity_Code: Activity code range (interval).
Monthly_Cost: Monthly cost range (currency units, interval).
Annual_Budget: Annual budget range (currency units, interval).
n: Number of entities in the sector (numeric).
| Sample size (n) | 14 |
| Variables (p) | 7 |
| Subject area | Finance |
| Symbolic format | Interval |
| Analytical tasks | PCA |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 5.2.
data(finance.int)data(finance.int)
Histogram-valued dataset of 16 airlines with 5 flight performance histograms. Each histogram has 12 bins describing the distribution of a performance metric across flights for that airline.
data(flights_detail.hist)data(flights_detail.hist)
A data frame with 16 observations (airlines) and 5 histogram-valued variables:
airtime: Histogram of air time (minutes).
taxi_in: Histogram of taxi-in time (minutes).
arrival_delay: Histogram of arrival delay (minutes).
taxi_out: Histogram of taxi-out time (minutes).
departure_delay: Histogram of departure delay (minutes).
Row names are Airline_1 through Airline_16.
| Sample size (n) | 16 |
| Variables (p) | 5 |
| Subject area | Transportation |
| Symbolic format | Histogram |
| Analytical tasks | Clustering |
Billard, L. and Diday, E. (2020), Table 5-1.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 5-1.
data(flights_detail.hist)data(flights_detail.hist)
Histogram-valued dataset of 22 French regions with 4 economic histogram variables related to agricultural production. Each histogram describes the distribution of farm-level values within a region.
data(french_agriculture.hist)data(french_agriculture.hist)
A data frame with 22 observations (French regions) and 4 histogram-valued variables:
Y_TSC: Histogram of total standard coefficient.
X_Wheat: Histogram of wheat production.
X_Pig: Histogram of pig production.
X_Cmilk: Histogram of cow milk production.
Row names are French region names (e.g., Ile-de-France, Picardie).
| Sample size (n) | 22 |
| Variables (p) | 4 |
| Subject area | Agriculture |
| Symbolic format | Histogram |
| Analytical tasks | Regression, Clustering |
HistDAWass R package (Agronomique dataset).
Irpino, A. and Verde, R. (2015). Basic statistics for distributional symbolic variables: A new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.
Original data from the HistDAWass R package (Agronomique dataset).
data(french_agriculture.hist)data(french_agriculture.hist)
Interval-valued dataset of heavy metal concentrations in organs and tissues of 12 freshwater fish species, grouped into 4 feeding categories (Carnivores, Omnivores, Detritivores, Herbivores). Contains 13 interval-valued variables measuring metal concentrations in organs and organ-to-muscle ratios.
data(freshwater_fish.int)data(freshwater_fish.int)
A data frame with 12 observations and 14 variables:
body_length: Body length (cm).
body_weight: Body weight (g).
muscle: Metal concentration in muscle tissue.
intestine: Metal concentration in intestine.
stomach: Metal concentration in stomach.
gills: Metal concentration in gills.
liver: Metal concentration in liver.
kidney: Metal concentration in kidney.
liver_muscle_ratio: Liver-to-muscle concentration ratio.
kidney_muscle_ratio: Kidney-to-muscle concentration ratio.
gills_muscle_ratio: Gills-to-muscle concentration ratio.
intestine_muscle_ratio: Intestine-to-muscle concentration ratio.
stomach_muscle_ratio: Stomach-to-muscle concentration ratio.
class: Feeding category (Carnivores, Omnivores, Detritivores, Herbivores).
| Sample size (n) | 12 |
| Variables (p) | 14 |
| Subject area | Biology |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
https://github.com/Natandradesa/Kernel-Clustering-for-Interval-Data
Andrade, N. A., de Carvalho, F. A. T. and Pimentel, B. A. (2025). Kernel clustering with automatic variable weighting for interval data. Neurocomputing, 617, 128954.
data(freshwater_fish.int)data(freshwater_fish.int)
Modal-valued dataset describing fuel consumption patterns across 10 regions by proportions of heating fuel types (gas, oil, electricity, other) and per-capita expenditure.
data(fuel_consumption.modal)data(fuel_consumption.modal)
A symbolic data frame (symbolic_tbl) with 10 observations and 3 variables:
Region: Region identifier (character).
Expenditure: Per-capita fuel expenditure (numeric).
Fuel_Type: Modal distribution over fuel types
(gas, oil, electric, other).
| Sample size (n) | 10 |
| Variables (p) | 3 |
| Subject area | Energy |
| Symbolic format | Modal |
| Analytical tasks | Regression |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 3.7.
data(fuel_consumption.modal)data(fuel_consumption.modal)
Interval-valued morphological measurements for 55 fungi specimens from 3 genera (Amanita, Agaricus, Boletus). Contains 5 interval-valued variables describing pileus and stipe dimensions and spore characteristics.
data(fungi.int)data(fungi.int)
A data frame with 55 observations and 6 variables:
pileus_width: Width of the pileus (cap).
stipe_width: Width of the stipe (stem).
stipe_thickness: Thickness of the stipe.
spore_height: Height of the spores.
spore_width: Width of the spores.
class: Fungus genus (Amanita, Agaricus, Boletus).
| Sample size (n) | 55 |
| Variables (p) | 6 |
| Subject area | Biology |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
https://github.com/Natandradesa/Kernel-Clustering-for-Interval-Data
Andrade, N. A., de Carvalho, F. A. T. and Pimentel, B. A. (2025). Kernel clustering with automatic variable weighting for interval data. Neurocomputing, 617, 128954.
data(fungi.int)data(fungi.int)
Interval-valued dataset of dinucleotide relative abundances for 14 genome classes. Each class aggregates multiple genomes; the intervals represent the range of observed abundance values within each class for 10 dinucleotide pairs, plus a count variable.
data(genome_abundances.int)data(genome_abundances.int)
A symbolic data frame (symbolic_tbl) with 14 observations
(genome classes) and 11 variables:
CG: Interval-valued CG dinucleotide relative abundance.
GC: Interval-valued GC dinucleotide relative abundance.
TA: Interval-valued TA dinucleotide relative abundance.
AT: Interval-valued AT dinucleotide relative abundance.
CC: Interval-valued CC dinucleotide relative abundance.
AA: Interval-valued AA dinucleotide relative abundance.
AC: Interval-valued AC dinucleotide relative abundance.
AG: Interval-valued AG dinucleotide relative abundance.
CA: Interval-valued CA dinucleotide relative abundance.
GA: Interval-valued GA dinucleotide relative abundance.
n: Number of genomes in the class (integer).
Row names are Class_1 through Class_14.
| Sample size (n) | 14 |
| Variables (p) | 11 |
| Subject area | Genomics |
| Symbolic format | Interval |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2020), Table 3-16.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 3-16.
data(genome_abundances.int)data(genome_abundances.int)
Histogram-valued dataset of 4 regions with a single histogram-valued variable describing the distribution of blood glucose measurements.
data(glucose.hist)data(glucose.hist)
A data frame with 4 observations (regions) and 1 histogram-valued variable:
glucose: Histogram of blood glucose levels.
Row names are Region_1 through Region_4.
| Sample size (n) | 4 |
| Variables (p) | 1 |
| Subject area | Medical |
| Symbolic format | Histogram |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2020), Table 4-14.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 4-14.
data(glucose.hist)data(glucose.hist)
Histogram-valued climate data for 5 hardwood tree species in the southeastern United States. Each observation represents a species with 4 histogram-valued climate variables.
data(hardwood.hist)data(hardwood.hist)
A data frame with 5 observations and 4 histogram-valued variables:
ANNT: Annual temperature histogram (degrees C).
JULT: July temperature histogram (degrees C).
ANNP: Annual precipitation histogram (mm).
MITM: Moisture index histogram.
| Sample size (n) | 5 |
| Variables (p) | 4 |
| Subject area | Forestry |
| Symbolic format | Histogram |
| Analytical tasks | Clustering, Descriptive statistics |
Extracted from RSDA package (hardwoodBrito).
Brito, P. (2007). Modelling and Analysing Interval Data. In V. Esposito Vinzi et al. (Eds.), New Developments in Classification and Data Analysis, pp. 197-208. Springer.
data(hardwood.hist)data(hardwood.hist)
Interval-valued World Bank gender indicators for 183 countries, with ordinal HDI classification. Contains interval ranges for Women, Business and the Law Index Score and proportion of seats held by women in national parliaments.
data(hdi_gender.int)data(hdi_gender.int)
A data frame with 183 observations and 6 variables:
code: ISO 3166-1 alpha-3 country code.
country: Country name.
hdi: Human Development Index value (UNDP).
women_law_index: Women, Business and the Law Index Score range.
women_parliament: Proportion of seats held by women in
national parliaments range (%).
hdi_category: Ordered factor with HDI classification
(Low < Medium < High < Very High).
| Sample size (n) | 183 |
| Variables (p) | 6 |
| Subject area | Socioeconomics |
| Symbolic format | Interval |
| Analytical tasks | Classification |
https://github.com/aleixalcacer/OCFIVD
Alcacer, A., Barrel, A., Groenen, P. J. F. and Grana, M. (2023). Ordinal classification for interval-valued data and ordinal data. Expert Systems with Applications, 238, 121825.
data(hdi_gender.int)data(hdi_gender.int)
Classical (microdata) health insurance dataset of 51 individual patient
records with 30 variables including demographics, clinical measurements,
and diagnostic indicators. This is the raw data underlying the
symbolic health_insurance2.modal dataset.
data(health_insurance.mix)data(health_insurance.mix)
A data frame with 51 observations and 30 variables (Y1–Y30):
Y1: City (character).
Y2: Gender (M/F, character).
Y3: Age (integer).
Y4: Sex (M/D, character).
Y5: Marital status (S/M, character).
Y6: Number of dependents (integer).
Y7: Parents alive indicator (integer).
Y8: Number of children (integer).
Y9: Height (cm, integer).
Y10: Weight (pounds, integer).
Y11: Systolic blood pressure (mmHg, integer).
Y12: Diastolic blood pressure (mmHg, integer).
Y13: Cholesterol (mg/dL, integer).
Y14: Cholesterol measure 2 (integer).
Y15: Additional lab measurement (integer).
Y16: Ratio measurement (numeric).
Y17: Lab value (integer).
Y18: Lab value (integer).
Y19: Lab value (integer).
Y20: Lab ratio (numeric).
Y21: Additional lab value (integer).
Y22: Additional lab value (integer).
Y23: Blood chemistry value (numeric).
Y24: Blood chemistry value (numeric).
Y25: Blood chemistry value (numeric).
Y26: Blood chemistry value (numeric).
Y27: Blood chemistry value (numeric).
Y28: Diagnostic indicator (Y/N, character).
Y29: Diagnostic indicator (Y/N, character).
Y30: Count variable (integer).
| Sample size (n) | 51 |
| Variables (p) | 30 |
| Subject area | Medical |
| Symbolic format | Classical (microdata) |
| Analytical tasks | Descriptive statistics, Aggregation |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Tables 2.1-2.2.
data(health_insurance.mix)data(health_insurance.mix)
Modal-valued symbolic version of the health insurance dataset, aggregated
into 6 disease-type-by-gender groups. See health_insurance.mix
for the underlying microdata.
data(health_insurance2.modal)data(health_insurance2.modal)
A symbolic data frame (symbolic_tbl) with 6 observations and
6 variables:
Type Gender: Disease type and gender label (character).
Age: Modal distribution over age bins.
Marital Status: Modal distribution over marital status (M, S).
Parents Alive: Modal distribution over number of parents alive (0, 1, 2).
Weight: Modal distribution over weight bins (pounds).
Cholesterol: Modal distribution over cholesterol bins (mg/dL).
| Sample size (n) | 6 |
| Variables (p) | 6 |
| Subject area | Medical |
| Symbolic format | Modal |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.2b.
data(health_insurance2.modal)data(health_insurance2.modal)
Bivariate histogram-valued dataset with 10 observations, each described by a 2-bin hematocrit histogram and a 2-bin hemoglobin histogram. Used for bivariate symbolic regression demonstrations.
data(hematocrit_hemoglobin.hist)data(hematocrit_hemoglobin.hist)
A data frame with 10 observations and 2 histogram-valued variables:
hematocrit: Histogram-valued hematocrit distribution (%).
hemoglobin: Histogram-valued hemoglobin distribution (g/dL).
| Sample size (n) | 10 |
| Variables (p) | 2 |
| Subject area | Medical |
| Symbolic format | Histogram |
| Analytical tasks | Regression |
Billard, L. and Diday, E. (2006), Table 6.8.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. Table 6.8.
data(hematocrit_hemoglobin.hist)data(hematocrit_hemoglobin.hist)
Histogram-valued hematocrit distributions for 14 gender-age groups (7 female + 7 male age groups from 20s to 80+). Each observation has a 10-bin histogram of hematocrit percentages.
data(hematocrit.hist)data(hematocrit.hist)
A data frame with 14 observations and 3 variables:
gender: Gender (Female or Male).
age: Age group (20s, 30s, 40s, 50s, 60s, 70s, 80+).
hematocrit: Histogram-valued hematocrit distribution (%).
| Sample size (n) | 14 |
| Variables (p) | 3 |
| Subject area | Medical |
| Symbolic format | Histogram |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006), Table 4.14.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. Table 4.14.
data(hematocrit.hist)data(hematocrit.hist)
Histogram-valued hemoglobin distributions for 14 gender-age groups (7 female + 7 male age groups from 20s to 80+). Each observation has a 10-bin histogram of hemoglobin levels (g/dL).
data(hemoglobin.hist)data(hemoglobin.hist)
A data frame with 14 observations and 3 variables:
gender: Gender (Female or Male).
age: Age group (20s, 30s, 40s, 50s, 60s, 70s, 80+).
hemoglobin: Histogram-valued hemoglobin distribution (g/dL).
| Sample size (n) | 14 |
| Variables (p) | 3 |
| Subject area | Medical |
| Symbolic format | Histogram |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006), Table 4.6.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. Table 4.6.
data(hemoglobin.hist)data(hemoglobin.hist)
Classical (microdata) dataset of 20 observations illustrating hierarchical
categorical structures with a response variable Y and hierarchical
predictors X1–X5. See hierarchy.int for the interval-valued
version.
data(hierarchy)data(hierarchy)
A data frame with 20 observations and 6 variables:
Y: Response variable (numeric).
X1: Hierarchy level 1 category (a/b/c, character).
X2: Hierarchy level 2 category (a1/a2, character; NA for non-a).
X3: Hierarchy level 3 category (a11/a12, character; NA for non-a1).
X4: Numeric predictor for group b (numeric; NA for non-b).
X5: Numeric predictor for group c (numeric; NA for non-c).
| Sample size (n) | 20 |
| Variables (p) | 6 |
| Subject area | Methodology |
| Symbolic format | Classical (microdata) |
| Analytical tasks | Aggregation, Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.15.
data(hierarchy)data(hierarchy)
Mixed symbolic dataset of 10 observations with hierarchical categorical variables, conditional histogram variables, and an interval-valued variable. From Table 6.20 of Billard and Diday (2007).
data(hierarchy.hist)data(hierarchy.hist)
A symbolic data frame (symbolic_tbl) with 10 observations
and 7 variables:
duration_time: Histogram-valued duration (2-bin).
hierarchy_1: Categorical hierarchy level 1 (a/b/c).
hierarchy_2: Categorical hierarchy level 2 (a1/a2), conditional on hierarchy_1 = a.
hierarchy_3: Categorical hierarchy level 3 (a11/a12), conditional on hierarchy_2 = a1.
glucose: Histogram-valued glucose (2-bin), conditional.
pulse_rate: Histogram-valued pulse rate (2-bin), conditional.
cholesterol: Interval-valued cholesterol level.
| Sample size (n) | 10 |
| Variables (p) | 7 |
| Subject area | Methodology |
| Symbolic format | Mixed (histogram, interval, categorical) |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2007), Table 6.20.
Billard, L. and Diday, E. (2007). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. Table 6.20.
data(hierarchy.hist)data(hierarchy.hist)
Interval-valued version of the hierarchy dataset. See hierarchy
for the classical version.
data(hierarchy.int)data(hierarchy.int)
A symbolic data frame (symbolic_tbl) with 20 observations and 6 variables:
Y: Response variable range (interval).
X1: Hierarchy level 1 category (a/b/c, character).
X2: Hierarchy level 2 category (a1/a2, character; NA for non-a).
X3: Hierarchy level 3 category (a11/a12, character; NA for non-a1).
X4: Predictor range for group b (interval; NA for non-b).
X5: Predictor range for group c (interval; NA for non-c).
| Sample size (n) | 20 |
| Variables (p) | 6 |
| Subject area | Methodology |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics, Regression |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.15.
data(hierarchy.int)data(hierarchy.int)
Functions to compute the mean, variance, covariance, and correlation of histogram-valued data.
hist_mean(x, var_name, method = "BG", ...) hist_var(x, var_name, method = "BG", ...) hist_cov(x, var_name1, var_name2, method = "BG", ...) hist_cor(x, var_name1, var_name2, method = "BG", ...)hist_mean(x, var_name, method = "BG", ...) hist_var(x, var_name, method = "BG", ...) hist_cov(x, var_name1, var_name2, method = "BG", ...) hist_cor(x, var_name1, var_name2, method = "BG", ...)
x |
histogram-valued data object. |
var_name |
the variable name or the column location. |
method |
method to calculate statistics. One of |
... |
additional parameters. |
var_name1 |
the variable name or the column location. |
var_name2 |
the variable name or the column location. |
Four functions are provided:
hist_mean: Compute the mean of histogram-valued data.
hist_var: Compute the variance of histogram-valued data.
hist_cov: Compute the covariance between two histogram-valued variables.
hist_cor: Compute the correlation between two histogram-valued variables.
Four methods are supported for all functions:
Bertrand and Goupil (2000) method. Uses histogram bin boundaries and probabilities to compute first and second moments.
Billard and Diday (2006) method. A signed decomposition using the sign of each bin's midpoint deviation from the overall mean and a quadratic form on the bin boundaries.
Billard (2008) method. Uses cross-products of deviations of the bin boundaries from the overall mean.
L2 Wasserstein method. Uses optimal-transport (Wasserstein) distances between the quantile functions of the histogram distributions.
For the mean, BG, BD, and B return the same value because they share the same first-order moment definition; only L2W uses a different (quantile-based) mean. For variance, covariance, and correlation, all four methods generally produce different results.
For hist_cor, the BG, BD, and B correlations all use the
Bertrand-Goupil standard deviation in the denominator, following
Irpino and Verde (2015, Eqs. 30–32). Only the L2W method uses its own
Wasserstein-based standard deviation in the denominator.
A numeric value or vector for hist_mean and hist_var; a single numeric value for hist_cov and hist_cor.
Po-Wei Chen, Han-Ming Wu
int_mean int_var int_cov int_cor
library(HistDAWass) x <- HistDAWass::BLOOD hist_mean(x, var_name = "Cholesterol", method = "BG") hist_mean(x, var_name = "Cholesterol", method = "BD") hist_var(x, var_name = "Cholesterol", method = "BG") hist_var(x, var_name = "Cholesterol", method = "BD") hist_cov(x, var_name1 = "Cholesterol", var_name2 = "Hemoglobin", method = "BG") hist_cor(x, var_name1 = "Cholesterol", var_name2 = "Hemoglobin", method = "BG")library(HistDAWass) x <- HistDAWass::BLOOD hist_mean(x, var_name = "Cholesterol", method = "BG") hist_mean(x, var_name = "Cholesterol", method = "BD") hist_var(x, var_name = "Cholesterol", method = "BG") hist_var(x, var_name = "Cholesterol", method = "BD") hist_cov(x, var_name1 = "Cholesterol", var_name2 = "Hemoglobin", method = "BG") hist_cor(x, var_name1 = "Cholesterol", var_name2 = "Hemoglobin", method = "BG")
Interval-valued data for 8 horse breeds (CES, CMA, PEN, TES, CEN, LES, PES, PAM) described by 6 variables: minimum/maximum weight, minimum/maximum height, cost of mares, cost of fillies.
data(horses.int)data(horses.int)
A symbolic data frame (symbolic_tbl) with 8 observations and 7 variables:
Breed: Horse breed code (CES, CMA, PEN, TES, CEN, LES, PES, PAM; character).
Minimum_Weight: Minimum weight range (kg, interval).
Maximum_Weight: Maximum weight range (kg, interval).
Minimum_Height: Minimum height range (cm, interval).
Maximum_Height: Maximum height range (cm, interval).
Mares_Cost: Cost of mares range (currency units, interval).
Fillies_Cost: Cost of fillies range (currency units, interval).
Extensively used in SDA for demonstrating divisive clustering, distance computation, hierarchy/pyramid construction, and complete objects.
| Sample size (n) | 8 |
| Variables (p) | 7 |
| Subject area | Zoology |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 7.14.
data(horses.int)data(horses.int)
Histogram-valued cost distributions for 15 hospitals. Each observation is a hospital with a 10-bin histogram of patient costs.
data(hospital.hist)data(hospital.hist)
A data frame with 15 observations and 1 histogram-valued variable:
cost: Histogram-valued cost distribution (currency units).
Row names are H1 through H15.
| Sample size (n) | 15 |
| Variables (p) | 1 |
| Subject area | Healthcare |
| Symbolic format | Histogram |
| Analytical tasks | Descriptive statistics, Clustering |
Billard, L. and Diday, E. (2006), Table 3.12.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. Table 3.12.
data(hospital.hist)data(hospital.hist)
Distribution-valued dataset of 12 counties with 3 categorical probability distribution variables describing household fuel type, number of rooms, and household income brackets.
data(household_characteristics.distr)data(household_characteristics.distr)
A data frame with 12 observations (counties) and 3 distribution-valued variables:
fuel_type: Distribution over fuel types
(gas, electric, oil, wood, none).
rooms: Distribution over room counts
({1,2}, {3,4,5}, {>=6}).
household_income: Distribution over income brackets
(<10, [10,25), [25,50), [50,75), [75,100), [100,150), [150,200), >=200).
Row names are County_1 through County_12.
| Sample size (n) | 12 |
| Variables (p) | 3 |
| Subject area | Socioeconomics |
| Symbolic format | Distribution |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2020), Table 6-1.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 6-1.
data(household_characteristics.distr)data(household_characteristics.distr)
Daily high and low values of the Brazilian IBOVESPA stock market index from January 3, 2000 to December 28, 2012 (3216 trading days). This dataset matches the period used by Maciel, Ballini and Gomide (2016) for evolving granular analytics for interval time series forecasting.
data(ibovespa.its)data(ibovespa.its)
A data frame with 3216 observations and 3 variables:
date: Trading date (Date class).
low: Daily low value of the IBOVESPA index.
high: Daily high value of the IBOVESPA index.
The IBOVESPA (Indice Bovespa) is the benchmark index of the Brazilian stock exchange (B3, formerly BM&FBOVESPA). It tracks the performance of the most actively traded stocks on the Sao Paulo stock exchange. The 13-year span of this dataset covers multiple market regimes including the 2008 global financial crisis, making it suitable for evaluating forecasting models under diverse conditions.
| Sample size (n) | 3216 |
| Variables (p) | 3 (date, low, high) |
| Subject area | Finance |
| Symbolic format | Interval time series |
| Analytical tasks | Forecasting, Time series analysis |
Yahoo Finance, ticker ^BVSP. Downloaded via the
quantmod package.
Maciel, L., Ballini, R. and Gomide, F. (2016). Evolving granular analytics for interval time series forecasting. Granular Computing, 1(4), 213–224.
data(ibovespa.its) head(ibovespa.its) plot(ibovespa.its$date, ibovespa.its$high, type = "l", col = "red", ylab = "Index Value", xlab = "Date", main = "IBOVESPA Daily High/Low (2000-2012)") lines(ibovespa.its$date, ibovespa.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)data(ibovespa.its) head(ibovespa.its) plot(ibovespa.its$date, ibovespa.its$high, type = "l", col = "red", ylab = "Index Value", xlab = "Date", main = "IBOVESPA Daily High/Low (2000-2012)") lines(ibovespa.its$date, ibovespa.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)
Convert iGAP format to a 3-dimensional array [n, p, 2].
iGAP_to_ARRAY(data, location = NULL)iGAP_to_ARRAY(data, location = NULL)
data |
A data.frame in iGAP format. |
location |
Integer vector specifying which columns contain comma-separated interval values. |
A numeric array of dimension [n, p, 2] with dimnames.
data(abalone.iGAP) arr <- iGAP_to_ARRAY(abalone.iGAP, 1:7) dim(arr)data(abalone.iGAP) arr <- iGAP_to_ARRAY(abalone.iGAP, 1:7) dim(arr)
To convert iGAP format to MM format.
iGAP_to_MM(data, location = NULL)iGAP_to_MM(data, location = NULL)
data |
The dataframe with the iGAP format. |
location |
The location of the symbolic variable in the data. |
Return a dataframe with the MM format.
data(abalone.iGAP) abalone <- iGAP_to_MM(abalone.iGAP, 1:7)data(abalone.iGAP) abalone <- iGAP_to_MM(abalone.iGAP, 1:7)
To convert iGAP format interval dataframe to RSDA format (symbolic_tbl).
iGAP_to_RSDA(data, location = NULL)iGAP_to_RSDA(data, location = NULL)
data |
The dataframe with the iGAP format. |
location |
The location of the symbolic variable in the data. |
Return a symbolic_tbl dataframe with complex-encoded interval columns.
data(abalone.iGAP) rsda <- iGAP_to_RSDA(abalone.iGAP, 1:7)data(abalone.iGAP) rsda <- iGAP_to_RSDA(abalone.iGAP, 1:7)
Automatically detect the format of interval data and convert it to the target format.
int_convert_format(x, to = "MM", from = NULL, ...)int_convert_format(x, to = "MM", from = NULL, ...)
x |
interval data in one of the supported formats |
to |
target format: "MM", "iGAP", "RSDA", "ARRAY", "SODAS" (default: "MM") |
from |
source format (optional): "MM", "iGAP", "RSDA", "ARRAY", "SODAS". If NULL, will auto-detect. |
... |
additional parameters passed to specific conversion functions |
This function provides a unified interface for all interval format conversions. It automatically detects the source format (unless specified) and applies the appropriate conversion function.
Supported conversions:
RSDA ??? MM, iGAP, ARRAY
MM ??? iGAP, RSDA, ARRAY
iGAP ??? MM, RSDA, ARRAY
ARRAY ??? RSDA, MM, iGAP
SODAS ??? MM, iGAP, ARRAY
Interval data in the target format
Han-Ming Wu
int_detect_format int_list_conversions RSDA_to_MM RSDA_to_ARRAY MM_to_RSDA MM_to_ARRAY ARRAY_to_RSDA ARRAY_to_MM ARRAY_to_iGAP iGAP_to_MM iGAP_to_RSDA iGAP_to_ARRAY MM_to_iGAP
# Auto-detect and convert to MM data(mushroom.int) data_mm <- int_convert_format(mushroom.int, to = "MM") # Explicitly specify source format data(abalone.iGAP) data_mm <- int_convert_format(abalone.iGAP, from = "iGAP", to = "MM") # Convert MM to iGAP data_igap <- int_convert_format(data_mm, to = "iGAP") # Convert multiple datasets to MM datasets <- list(mushroom.int, abalone.int, car.int) mm_datasets <- lapply(datasets, int_convert_format, to = "MM") # Check what conversions are available int_list_conversions()# Auto-detect and convert to MM data(mushroom.int) data_mm <- int_convert_format(mushroom.int, to = "MM") # Explicitly specify source format data(abalone.iGAP) data_mm <- int_convert_format(abalone.iGAP, from = "iGAP", to = "MM") # Convert MM to iGAP data_igap <- int_convert_format(data_mm, to = "iGAP") # Convert multiple datasets to MM datasets <- list(mushroom.int, abalone.int, car.int) mm_datasets <- lapply(datasets, int_convert_format, to = "MM") # Check what conversions are available int_list_conversions()
Automatically detect the format of interval data.
int_detect_format(x)int_detect_format(x)
x |
interval data in unknown format |
Detection rules:
RSDA: has class "symbolic_tbl" and contains complex columns
MM: data.frame with paired "_min" and "_max" columns
iGAP: data.frame with columns containing comma-separated values (e.g., "1.2,3.4")
ARRAY: a 3-dimensional array with dim[3] = 2 (min/max slices)
SODAS: character string ending with ".xml" (file path)
SDS: alias for SODAS
A character string indicating the detected format: "RSDA", "MM", "iGAP", "ARRAY", "SODAS", or "unknown"
data(mushroom.int) int_detect_format(mushroom.int) # Should return "RSDA" data(abalone.iGAP) int_detect_format(abalone.iGAP) # Should return "iGAP" # ARRAY format x <- array(1:24, dim = c(4, 3, 2)) int_detect_format(x) # Should return "ARRAY"data(mushroom.int) int_detect_format(mushroom.int) # Should return "RSDA" data(abalone.iGAP) int_detect_format(abalone.iGAP) # Should return "iGAP" # ARRAY format x <- array(1:24, dim = c(4, 3, 2)) int_detect_format(x) # Should return "ARRAY"
List all available format conversion functions.
int_list_conversions(from = NULL, to = NULL)int_list_conversions(from = NULL, to = NULL)
from |
source format (optional): "RSDA", "MM", "iGAP", "ARRAY", "SODAS" |
to |
target format (optional): "RSDA", "MM", "iGAP", "ARRAY", "SODAS" |
A data.frame showing available conversions
# List all conversions int_list_conversions() # List conversions from RSDA int_list_conversions(from = "RSDA") # List conversions to MM int_list_conversions(to = "MM")# List all conversions int_list_conversions() # List conversions from RSDA int_list_conversions(from = "RSDA") # List conversions to MM int_list_conversions(to = "MM")
Functions to compute various distance measures between interval-valued observations.
int_dist_all computes all available distance measures at once.
int_dist(x, method = "euclidean", gamma = 0.5, q = 1, p = 2, ...) int_dist_matrix(x, method = "euclidean", gamma = 0.5, q = 1, p = 2, ...) int_pairwise_dist(x, var_name1, var_name2, method = "euclidean", ...) int_dist_all(x, gamma = 0.5, q = 1)int_dist(x, method = "euclidean", gamma = 0.5, q = 1, p = 2, ...) int_dist_matrix(x, method = "euclidean", gamma = 0.5, q = 1, p = 2, ...) int_pairwise_dist(x, var_name1, var_name2, method = "euclidean", ...) int_dist_all(x, gamma = 0.5, q = 1)
x |
interval-valued data with symbolic_tbl class, or an array of dimension [n, p, 2] |
method |
distance method: "GD", "IY", "L1", "L2", "CB", "HD", "EHD", "nEHD", "snEHD", "TD", "WD", "euclidean", "hausdorff", "manhattan", "city_block", "minkowski", "wasserstein", "ichino", "de_carvalho" |
gamma |
parameter for the Ichino-Yaguchi distance, 0 <= gamma <= 0.5 (default: 0.5) |
q |
parameter for the Ichino-Yaguchi distance (Minkowski exponent) (default: 1) |
p |
power parameter for Minkowski distance (default: 2) |
... |
additional parameters |
var_name1 |
first variable name or column location |
var_name2 |
second variable name or column location |
Available distance methods:
GD: Gowda-Diday distance (Gowda & Diday, 1991)
IY: Ichino-Yaguchi distance (Ichino, 1988)
L1: L1 (midpoint Manhattan) distance
L2: L2 (Euclidean midpoint) distance
CB: City-Block distance (Souza & de Carvalho, 2004)
HD: Hausdorff distance (Chavent & Lechevallier, 2002)
EHD: Euclidean Hausdorff distance
nEHD: Normalized Euclidean Hausdorff distance
snEHD: Span Normalized Euclidean Hausdorff distance
TD: Tran-Duckstein distance (Tran & Duckstein, 2002)
WD: L2-Wasserstein distance (Verde & Irpino, 2008)
euclidean: Euclidean distance on interval centers (same as L2)
hausdorff: Hausdorff distance (same as HD)
manhattan: Manhattan distance (same as L1)
city_block: City-block distance (same as CB)
minkowski: Minkowski distance with parameter p
wasserstein: Wasserstein distance (same as WD)
ichino: Ichino-Yaguchi distance (simplified version)
de_carvalho: De Carvalho distance
A distance matrix (class 'dist') or numeric vector
Han-Ming Wu
Gowda, K. C., & Diday, E. (1991). Symbolic clustering using a new dissimilarity measure. Pattern Recognition, 24(6), 567-578.
Ichino, M. (1988). General metrics for mixed features. Systems and Computers in Japan, 19(2), 37-50.
Chavent, M., & Lechevallier, Y. (2002). Dynamical clustering of interval data. In Classification, Clustering and Data Analysis (pp. 53-60). Springer.
Tran, L., & Duckstein, L. (2002). Comparison of fuzzy numbers using a fuzzy distance measure. Fuzzy Sets and Systems, 130, 331-341.
Verde, R., & Irpino, A. (2008). A new interval data distance based on the Wasserstein metric.
Kao, C.-H. et al. (2014). Exploratory data analysis of interval-valued symbolic data with matrix visualization. CSDA, 79, 14-29.
int_dist_matrix int_dist_all int_pairwise_dist
# Using symbolic_tbl format data(mushroom.int) d1 <- int_dist(mushroom.int[, 3:4], method = "euclidean") d2 <- int_dist(mushroom.int[, 3:4], method = "hausdorff") d3 <- int_dist(mushroom.int[, 3:4], method = "GD") # Using array format: 4 concepts, 3 variables x <- array(NA, dim = c(4, 3, 2)) x[,,1] <- matrix(c(1,2,3,4, 5,6,7,8, 9,10,11,12), nrow=4) x[,,2] <- matrix(c(3,5,6,7, 8,9,10,12, 13,15,16,18), nrow=4) d4 <- int_dist(x, method = "snEHD") d5 <- int_dist(x, method = "IY", gamma = 0.3)# Using symbolic_tbl format data(mushroom.int) d1 <- int_dist(mushroom.int[, 3:4], method = "euclidean") d2 <- int_dist(mushroom.int[, 3:4], method = "hausdorff") d3 <- int_dist(mushroom.int[, 3:4], method = "GD") # Using array format: 4 concepts, 3 variables x <- array(NA, dim = c(4, 3, 2)) x[,,1] <- matrix(c(1,2,3,4, 5,6,7,8, 9,10,11,12), nrow=4) x[,,2] <- matrix(c(3,5,6,7, 8,9,10,12, 13,15,16,18), nrow=4) d4 <- int_dist(x, method = "snEHD") d5 <- int_dist(x, method = "IY", gamma = 0.3)
Functions to compute geometric characteristics of interval-valued data.
int_width(x, var_name, ...) int_radius(x, var_name, ...) int_center(x, var_name, ...) int_overlap(x, var_name1, var_name2, ...) int_containment(x, var_name1, var_name2, ...) int_midrange(x, var_name, ...)int_width(x, var_name, ...) int_radius(x, var_name, ...) int_center(x, var_name, ...) int_overlap(x, var_name1, var_name2, ...) int_containment(x, var_name1, var_name2, ...) int_midrange(x, var_name, ...)
x |
interval-valued data with symbolic_tbl class. |
var_name |
the variable name or the column location (multiple variables are allowed). |
... |
additional parameters |
var_name1 |
the first variable name or column location. |
var_name2 |
the second variable name or column location. |
These functions compute basic geometric properties:
int_width: Width of each interval (upper - lower)
int_radius: Radius of each interval (width / 2)
int_center: Center point of each interval ((lower + upper) / 2)
int_overlap: Overlap measure between two interval variables
int_containment: Check if one interval contains another
int_midrange: Half-range of each interval ((upper - lower) / 2)
A numeric matrix or value
Han-Ming Wu
int_width int_radius int_center int_overlap
data(mushroom.int) # Calculate interval widths int_width(mushroom.int, var_name = "Pileus.Cap.Width") int_width(mushroom.int, var_name = 2:3) # Calculate interval radius int_radius(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness")) # Get interval centers int_center(mushroom.int, var_name = 2:4) # Measure overlap between two variables int_overlap(mushroom.int, "Pileus.Cap.Width", "Stipe.Length") # Check containment int_containment(mushroom.int, "Pileus.Cap.Width", "Stipe.Length") # Calculate midrange int_midrange(mushroom.int, var_name = 2:3)data(mushroom.int) # Calculate interval widths int_width(mushroom.int, var_name = "Pileus.Cap.Width") int_width(mushroom.int, var_name = 2:3) # Calculate interval radius int_radius(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness")) # Get interval centers int_center(mushroom.int, var_name = 2:4) # Measure overlap between two variables int_overlap(mushroom.int, "Pileus.Cap.Width", "Stipe.Length") # Check containment int_containment(mushroom.int, "Pileus.Cap.Width", "Stipe.Length") # Calculate midrange int_midrange(mushroom.int, var_name = 2:3)
Functions to compute position and scale statistics for interval-valued data.
int_median(x, var_name, method = "CM", ...) int_quantile(x, var_name, probs = c(0.25, 0.5, 0.75), method = "CM", ...) int_range(x, var_name, method = "CM", ...) int_iqr(x, var_name, method = "CM", ...) int_mad(x, var_name, method = "CM", ...) int_mode(x, var_name, method = "CM", breaks = 30, ...)int_median(x, var_name, method = "CM", ...) int_quantile(x, var_name, probs = c(0.25, 0.5, 0.75), method = "CM", ...) int_range(x, var_name, method = "CM", ...) int_iqr(x, var_name, method = "CM", ...) int_mad(x, var_name, method = "CM", ...) int_mode(x, var_name, method = "CM", breaks = 30, ...)
x |
interval-valued data with symbolic_tbl class. |
var_name |
the variable name or the column location (multiple variables are allowed). |
method |
methods to calculate statistics: CM (default), VM, QM, SE, FV, EJD, GQ, SPT. |
... |
additional parameters |
probs |
numeric vector of probabilities with values in [0,1]. |
breaks |
number of histogram breaks for mode estimation (default: 30). |
These functions provide position and scale measures:
int_median: Median of interval data
int_quantile: Quantiles of interval data
int_range: Range (max - min) of interval data
int_iqr: Interquartile range (Q3 - Q1)
int_mad: Median absolute deviation
int_mode: Mode of interval data (estimated via histogram)
A numeric matrix or value
Han-Ming Wu
int_mean int_var int_median int_quantile
data(mushroom.int) # Calculate median int_median(mushroom.int, var_name = "Pileus.Cap.Width") int_median(mushroom.int, var_name = 2:3, method = c("CM", "EJD")) # Calculate quantiles int_quantile(mushroom.int, var_name = 2, probs = c(0.25, 0.5, 0.75)) # Calculate interquartile range int_iqr(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness")) # Calculate range int_range(mushroom.int, var_name = "Pileus.Cap.Width") # Calculate MAD int_mad(mushroom.int, var_name = 2:3, method = "CM") # Estimate mode int_mode(mushroom.int, var_name = "Stipe.Length", method = "CM")data(mushroom.int) # Calculate median int_median(mushroom.int, var_name = "Pileus.Cap.Width") int_median(mushroom.int, var_name = 2:3, method = c("CM", "EJD")) # Calculate quantiles int_quantile(mushroom.int, var_name = 2, probs = c(0.25, 0.5, 0.75)) # Calculate interquartile range int_iqr(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness")) # Calculate range int_range(mushroom.int, var_name = "Pileus.Cap.Width") # Calculate MAD int_mad(mushroom.int, var_name = 2:3, method = "CM") # Estimate mode int_mode(mushroom.int, var_name = "Stipe.Length", method = "CM")
Functions to compute robust statistics for interval-valued data.
int_trimmed_mean(x, var_name, trim = 0.1, method = "CM", ...) int_winsorized_mean(x, var_name, trim = 0.1, method = "CM", ...) int_trimmed_var(x, var_name, trim = 0.1, method = "CM", ...) int_winsorized_var(x, var_name, trim = 0.1, method = "CM", ...)int_trimmed_mean(x, var_name, trim = 0.1, method = "CM", ...) int_winsorized_mean(x, var_name, trim = 0.1, method = "CM", ...) int_trimmed_var(x, var_name, trim = 0.1, method = "CM", ...) int_winsorized_var(x, var_name, trim = 0.1, method = "CM", ...)
x |
interval-valued data with symbolic_tbl class. |
var_name |
the variable name or the column location (multiple variables are allowed). |
trim |
the fraction (0 to 0.5) of observations to be trimmed from each end. |
method |
methods to calculate statistics: CM (default), VM, QM, SE, FV, EJD, GQ, SPT. |
... |
additional parameters |
These functions provide robust alternatives to standard statistics:
int_trimmed_mean: Mean after trimming extreme values
int_winsorized_mean: Mean after winsorizing extreme values
int_trimmed_var: Variance after trimming extreme values
int_winsorized_var: Variance after winsorizing extreme values
Trimming vs Winsorizing:
Trimming: Remove extreme values
Winsorizing: Replace extreme values with less extreme values
A numeric matrix
Han-Ming Wu
int_mean int_var int_trimmed_mean
data(mushroom.int) # Trimmed mean (10% from each end) int_trimmed_mean(mushroom.int, var_name = "Pileus.Cap.Width", trim = 0.1) # Winsorized mean int_winsorized_mean(mushroom.int, var_name = 2:3, trim = 0.05, method = "CM") # Trimmed variance int_trimmed_var(mushroom.int, var_name = c("Stipe.Length"), trim = 0.1)data(mushroom.int) # Trimmed mean (10% from each end) int_trimmed_mean(mushroom.int, var_name = "Pileus.Cap.Width", trim = 0.1) # Winsorized mean int_winsorized_mean(mushroom.int, var_name = 2:3, trim = 0.05, method = "CM") # Trimmed variance int_trimmed_var(mushroom.int, var_name = c("Stipe.Length"), trim = 0.1)
Functions to compute shape statistics (skewness, kurtosis) for interval-valued data.
int_skewness(x, var_name, method = "CM", ...) int_kurtosis(x, var_name, method = "CM", ...) int_symmetry(x, var_name, method = "CM", ...) int_tailedness(x, var_name, method = "CM", ...)int_skewness(x, var_name, method = "CM", ...) int_kurtosis(x, var_name, method = "CM", ...) int_symmetry(x, var_name, method = "CM", ...) int_tailedness(x, var_name, method = "CM", ...)
x |
interval-valued data with symbolic_tbl class. |
var_name |
the variable name or the column location (multiple variables are allowed). |
method |
methods to calculate statistics: CM (default), VM, QM, SE, FV, EJD, GQ, SPT. |
... |
additional parameters |
These functions measure distribution shape:
int_skewness: Measure of asymmetry (skewness)
int_kurtosis: Measure of tail heaviness (kurtosis)
int_symmetry: Symmetry coefficient
int_tailedness: Tailedness measure (alias for excess kurtosis)
Skewness interpretation:
= 0: Symmetric distribution
> 0: Right-skewed (positive skew)
< 0: Left-skewed (negative skew)
Kurtosis interpretation (excess kurtosis):
= 0: Normal distribution (mesokurtic)
> 0: Heavy tails (leptokurtic)
< 0: Light tails (platykurtic)
A numeric matrix
Han-Ming Wu
int_mean int_var int_skewness int_kurtosis
data(mushroom.int) # Calculate skewness int_skewness(mushroom.int, var_name = "Pileus.Cap.Width") int_skewness(mushroom.int, var_name = 2:3, method = c("CM", "EJD")) # Calculate kurtosis int_kurtosis(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness")) # Check symmetry int_symmetry(mushroom.int, var_name = 2:4, method = "CM") # Check tailedness int_tailedness(mushroom.int, var_name = "Pileus.Cap.Width", method = "CM")data(mushroom.int) # Calculate skewness int_skewness(mushroom.int, var_name = "Pileus.Cap.Width") int_skewness(mushroom.int, var_name = 2:3, method = c("CM", "EJD")) # Calculate kurtosis int_kurtosis(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness")) # Check symmetry int_symmetry(mushroom.int, var_name = 2:4, method = "CM") # Check tailedness int_tailedness(mushroom.int, var_name = "Pileus.Cap.Width", method = "CM")
Functions to compute similarity measures between interval-valued observations.
int_jaccard(x, var_name1, var_name2, ...) int_dice(x, var_name1, var_name2, ...) int_cosine(x, var_name1, var_name2, ...) int_overlap_coefficient(x, var_name1, var_name2, ...) int_tanimoto(x, var_name1, var_name2, ...) int_similarity_matrix(x, method = "jaccard", ...)int_jaccard(x, var_name1, var_name2, ...) int_dice(x, var_name1, var_name2, ...) int_cosine(x, var_name1, var_name2, ...) int_overlap_coefficient(x, var_name1, var_name2, ...) int_tanimoto(x, var_name1, var_name2, ...) int_similarity_matrix(x, method = "jaccard", ...)
x |
interval-valued data with symbolic_tbl class. |
var_name1 |
the first variable name or column location. |
var_name2 |
the second variable name or column location. |
... |
additional parameters |
method |
similarity method for int_similarity_matrix: "jaccard", "dice", or "overlap". |
These functions compute various similarity measures:
int_jaccard: Jaccard similarity coefficient
int_dice: Dice similarity coefficient
int_cosine: Cosine similarity
int_overlap_coefficient: Overlap coefficient
int_tanimoto: Tanimoto coefficient (generalized Jaccard)
int_similarity_matrix: Pairwise similarity matrix across all observations
All similarity measures range from 0 (no similarity) to 1 (perfect similarity).
A numeric matrix or value
Han-Ming Wu
int_dist int_cor int_jaccard
data(mushroom.int) # Jaccard similarity int_jaccard(mushroom.int, "Pileus.Cap.Width", "Stipe.Length") # Dice coefficient int_dice(mushroom.int, 2, 3) # Cosine similarity int_cosine(mushroom.int, var_name1 = c("Pileus.Cap.Width"), var_name2 = c("Stipe.Length", "Stipe.Thickness")) # Overlap coefficient int_overlap_coefficient(mushroom.int, 2, 3:4) # Tanimoto coefficient int_tanimoto(mushroom.int, "Pileus.Cap.Width", "Stipe.Length") # Similarity matrix across all observations int_similarity_matrix(mushroom.int, method = "jaccard")data(mushroom.int) # Jaccard similarity int_jaccard(mushroom.int, "Pileus.Cap.Width", "Stipe.Length") # Dice coefficient int_dice(mushroom.int, 2, 3) # Cosine similarity int_cosine(mushroom.int, var_name1 = c("Pileus.Cap.Width"), var_name2 = c("Stipe.Length", "Stipe.Thickness")) # Overlap coefficient int_overlap_coefficient(mushroom.int, 2, 3:4) # Tanimoto coefficient int_tanimoto(mushroom.int, "Pileus.Cap.Width", "Stipe.Length") # Similarity matrix across all observations int_similarity_matrix(mushroom.int, method = "jaccard")
Functions to compute the mean, variance, covariance, and correlation of interval-valued data.
int_mean(x, var_name, method = "CM", ...) int_var(x, var_name, method = "CM", ...) int_cov(x, var_name1, var_name2, method = "CM", ...) int_cor(x, var_name1, var_name2, method = "CM", ...)int_mean(x, var_name, method = "CM", ...) int_var(x, var_name, method = "CM", ...) int_cov(x, var_name1, var_name2, method = "CM", ...) int_cor(x, var_name1, var_name2, method = "CM", ...)
x |
interval-valued data with symbolic_tbl class. |
var_name |
the variable name or the column location (multiple variables are allowed). |
method |
methods to calculate statistics: CM (default), VM, QM, SE, FV, EJD, GQ, SPT. |
... |
additional parameters |
var_name1 |
the variable name or the column location (multiple variables are allowed). |
var_name2 |
the variable name or the column location (multiple variables are allowed). |
Available methods (applicable to all four functions):
CM: Center Method — uses midpoints (a + b) / 2
VM: Vertices Method — uses all 2^p vertex combinations
QM: Quantiles Method — uses equally spaced quantile points
SE: Set Expansion — uses endpoints only (quantiles with m = 1)
FV: Fitted Values — uses linear regression fitted values
EJD: Empirical Joint Distribution
GQ: Symbolic Covariance method (Billard and Diday, 2006)
SPT: Total Sum of Products (Billard, 2008)
A numeric matrix for int_mean and int_var (methods x variables);
a named list of covariance/correlation matrices for int_cov and int_cor
(one matrix per method).
Han-Ming Wu
int_mean int_var int_cov int_cor
data(mushroom.int) int_mean(mushroom.int, var_name = "Pileus.Cap.Width") int_mean(mushroom.int, var_name = 2:3) var_name <- c("Stipe.Length", "Stipe.Thickness") method <- c("CM", "FV", "EJD") int_mean(mushroom.int, var_name, method) int_var(mushroom.int, var_name, method) var_name1 <- "Pileus.Cap.Width" var_name2 <- c("Stipe.Length", "Stipe.Thickness") method <- c("CM", "VM", "EJD", "GQ", "SPT") int_cov(mushroom.int, var_name1, var_name2, method) int_cor(mushroom.int, var_name1, var_name2, method)data(mushroom.int) int_mean(mushroom.int, var_name = "Pileus.Cap.Width") int_mean(mushroom.int, var_name = 2:3) var_name <- c("Stipe.Length", "Stipe.Thickness") method <- c("CM", "FV", "EJD") int_mean(mushroom.int, var_name, method) int_var(mushroom.int, var_name, method) var_name1 <- "Pileus.Cap.Width" var_name2 <- c("Stipe.Length", "Stipe.Thickness") method <- c("CM", "VM", "EJD", "GQ", "SPT") int_cov(mushroom.int, var_name1, var_name2, method) int_cor(mushroom.int, var_name1, var_name2, method)
Functions to compute uncertainty and variability measures for interval-valued data.
int_entropy(x, var_name, method = "CM", base = 2, ...) int_cv(x, var_name, method = "CM", ...) int_dispersion(x, var_name, method = "CM", ...) int_imprecision(x, var_name, ...) int_granularity(x, var_name, ...) int_uniformity(x, var_name, ...) int_information_content(x, var_name, method = "CM", ...)int_entropy(x, var_name, method = "CM", base = 2, ...) int_cv(x, var_name, method = "CM", ...) int_dispersion(x, var_name, method = "CM", ...) int_imprecision(x, var_name, ...) int_granularity(x, var_name, ...) int_uniformity(x, var_name, ...) int_information_content(x, var_name, method = "CM", ...)
x |
interval-valued data with symbolic_tbl class. |
var_name |
the variable name or the column location (multiple variables are allowed). |
method |
methods to calculate statistics: CM (default), VM, QM, SE, FV, EJD, GQ, SPT. |
base |
logarithm base for entropy calculation (default: 2) |
... |
additional parameters |
These functions measure uncertainty and variability:
int_entropy: Shannon entropy (information content)
int_cv: Coefficient of variation (CV = SD / Mean)
int_dispersion: General dispersion index
int_imprecision: Imprecision based on interval width
int_granularity: Variability in interval sizes
int_uniformity: Uniformity of interval widths (inverse of granularity)
int_information_content: Normalized entropy (entropy / log2(n))
A numeric matrix or value
Han-Ming Wu
int_var int_entropy int_cv
data(mushroom.int) # Calculate entropy int_entropy(mushroom.int, var_name = "Pileus.Cap.Width") # Coefficient of variation int_cv(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness"), method = c("CM", "EJD")) # Measure imprecision int_imprecision(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness")) # Dispersion index int_dispersion(mushroom.int, var_name = "Pileus.Cap.Width", method = "CM") # Check data granularity int_granularity(mushroom.int, var_name = 2:4) # Check uniformity int_uniformity(mushroom.int, var_name = 2:3) # Information content int_information_content(mushroom.int, var_name = "Stipe.Length", method = "CM")data(mushroom.int) # Calculate entropy int_entropy(mushroom.int, var_name = "Pileus.Cap.Width") # Coefficient of variation int_cv(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness"), method = c("CM", "EJD")) # Measure imprecision int_imprecision(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness")) # Dispersion index int_dispersion(mushroom.int, var_name = "Pileus.Cap.Width", method = "CM") # Check data granularity int_granularity(mushroom.int, var_name = 2:4) # Check uniformity int_uniformity(mushroom.int, var_name = 2:3) # Information content int_information_content(mushroom.int, var_name = "Stipe.Length", method = "CM")
Histogram-valued dataset of 3 iris species (Versicolor, Virginica, Setosa) with 4 histogram-valued morphological variables and a species label. Each histogram describes the distribution of measurements within a species.
data(iris_species.hist)data(iris_species.hist)
A data frame with 3 observations and 5 variables:
species: Species name (factor: Versicolor, Virginica, Setosa).
sepal_width: Histogram-valued sepal width distribution.
sepal_length: Histogram-valued sepal length distribution.
petal_width: Histogram-valued petal width distribution.
petal_length: Histogram-valued petal length distribution.
Row names are species names.
| Sample size (n) | 3 |
| Variables (p) | 5 |
| Subject area | Botany |
| Symbolic format | Histogram |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2020), Table 4-10.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 4-10.
data(iris_species.hist)data(iris_species.hist)
Interval-valued version of the classic iris dataset, aggregated from Fisher's iris data into 30 interval observations across 3 species (Setosa, Versicolor, Virginica). Each observation represents a group of flowers with ranges for sepal and petal measurements.
data(iris.int)data(iris.int)
A data frame with 30 observations and 5 variables:
sepal_length: Sepal length range (cm).
sepal_width: Sepal width range (cm).
petal_length: Petal length range (cm).
petal_width: Petal width range (cm).
class: Species (Setosa, Versicolor, Virginica).
| Sample size (n) | 30 |
| Variables (p) | 5 |
| Subject area | Botany |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
https://github.com/Natandradesa/Kernel-Clustering-for-Interval-Data
Andrade, N. A., de Carvalho, F. A. T. and Pimentel, B. A. (2025). Kernel clustering with automatic variable weighting for interval data. Neurocomputing, 617, 128954.
data(iris.int)data(iris.int)
Monthly interval-valued wind speed data at 5 meteorological stations in Ireland from January 1961 to December 1978 (216 months). For each month and station, the interval is defined as [minimum daily average wind speed, maximum daily average wind speed] across all days in that month.
data(irish_wind.its)data(irish_wind.its)
A data frame with 216 observations and 11 columns (5 interval
variables in _l/_u Min-Max pairs, plus a date):
date: First day of the month (Date class).
BIR_l, BIR_u: Monthly [min, max] daily wind speed
at Birr (knots).
DUB_l, DUB_u: Monthly [min, max] daily wind speed
at Dublin Airport (knots).
KIL_l, KIL_u: Monthly [min, max] daily wind speed
at Kilkenny (knots).
SHA_l, SHA_u: Monthly [min, max] daily wind speed
at Shannon Airport (knots).
VAL_l, VAL_u: Monthly [min, max] daily wind speed
at Valentia Observatory (knots).
The original data contains daily average wind speeds (in knots) at 12 synoptic meteorological stations in the Republic of Ireland, collected by the Irish Meteorological Service. This is the classic Haslett and Raftery (1989) dataset, one of the most widely used benchmarks in spatial statistics. Following the approach of Teles and Brito (2015), the raw daily data is aggregated to monthly intervals for 5 selected stations: Birr (BIR), Dublin Airport (DUB), Kilkenny (KIL), Shannon Airport (SHA), and Valentia Observatory (VAL). Each monthly interval captures the range of daily wind variability within that month.
| Sample size (n) | 216 |
| Variables (p) | 11 |
| Subject area | Meteorology |
| Symbolic format | Interval time series (multivariate) |
| Analytical tasks | Space-time modelling, Forecasting, Clustering |
Derived from the wind dataset in the gstat R
package (originally from Haslett and Raftery, 1989). Daily data
aggregated to monthly intervals.
Haslett, J. and Raftery, A. E. (1989). Space-time modelling with long-memory dependence: Assessing Ireland's wind power resource. Journal of the Royal Statistical Society, Series C (Applied Statistics), 38(1), 1–50.
Teles, P. and Brito, P. (2015). Modeling interval time series with space-time processes. Communications in Statistics – Theory and Methods, 44(17), 3599–3619.
data(irish_wind.its) head(irish_wind.its) # Plot Valentia Observatory wind speed interval plot(irish_wind.its$date, irish_wind.its$VAL_u, type = "l", col = "red", ylab = "Wind speed (knots)", xlab = "Date", main = "Valentia Observatory Monthly Wind Speed Interval") lines(irish_wind.its$date, irish_wind.its$VAL_l, col = "blue") legend("topright", c("Max", "Min"), col = c("red", "blue"), lty = 1)data(irish_wind.its) head(irish_wind.its) # Plot Valentia Observatory wind speed interval plot(irish_wind.its$date, irish_wind.its$VAL_u, type = "l", col = "red", ylab = "Wind speed (knots)", xlab = "Date", main = "Valentia Observatory Monthly Wind Speed Interval") lines(irish_wind.its$date, irish_wind.its$VAL_l, col = "blue") legend("topright", c("Max", "Min"), col = c("red", "blue"), lty = 1)
Mixed symbolic dataset of 10 jogger groups with one interval-valued variable (pulse rate) and one histogram-valued variable (running time distribution).
data(joggers.mix)data(joggers.mix)
A symbolic data frame (symbolic_tbl) with 10 observations
(jogger groups) and 2 variables:
pulse_rate: Interval-valued resting pulse rate range (bpm).
running_time: Histogram-valued distribution of running
times (minutes).
Row names are Group_1 through Group_10.
| Sample size (n) | 10 |
| Variables (p) | 2 |
| Subject area | Sports |
| Symbolic format | Mixed (interval, histogram) |
| Analytical tasks | Clustering |
Billard, L. and Diday, E. (2020), Table 2-5.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 2-5.
data(joggers.mix)data(joggers.mix)
Interval-valued ratings from Judge 1 for 6 regions on 4 variables. From a study of generalized principal component analysis for interval-valued data (GPCSIV).
data(judge1.int)data(judge1.int)
A symbolic data frame (symbolic_tbl) with 6 observations
and 4 interval-valued variables (V1–V4).
| Sample size (n) | 6 |
| Variables (p) | 4 |
| Subject area | Methodology |
| Symbolic format | Interval |
| Analytical tasks | PCA |
GPCSIV R package (Judge1 dataset).
Makosso-Kallyth, S. and Diday, E. (2012). Adaptation of interval PCA to symbolic histogram variables. Advances in Data Analysis and Classification, 6(2), 147–159.
Original data from the GPCSIV R package (Judge1 dataset).
data(judge1.int)data(judge1.int)
Interval-valued ratings from Judge 2 for 6 regions on 4 variables. From a study of generalized principal component analysis for interval-valued data (GPCSIV).
data(judge2.int)data(judge2.int)
A symbolic data frame (symbolic_tbl) with 6 observations
and 4 interval-valued variables (V1–V4).
| Sample size (n) | 6 |
| Variables (p) | 4 |
| Subject area | Methodology |
| Symbolic format | Interval |
| Analytical tasks | PCA |
GPCSIV R package (Judge2 dataset).
Makosso-Kallyth, S. and Diday, E. (2012). Adaptation of interval PCA to symbolic histogram variables. Advances in Data Analysis and Classification, 6(2), 147–159.
Original data from the GPCSIV R package (Judge2 dataset).
data(judge2.int)data(judge2.int)
Interval-valued ratings from Judge 3 for 6 regions on 4 variables. From a study of generalized principal component analysis for interval-valued data (GPCSIV).
data(judge3.int)data(judge3.int)
A symbolic data frame (symbolic_tbl) with 6 observations
and 4 interval-valued variables (V1–V4).
| Sample size (n) | 6 |
| Variables (p) | 4 |
| Subject area | Methodology |
| Symbolic format | Interval |
| Analytical tasks | PCA |
GPCSIV R package (Judge3 dataset).
Makosso-Kallyth, S. and Diday, E. (2012). Adaptation of interval PCA to symbolic histogram variables. Advances in Data Analysis and Classification, 6(2), 147–159.
Original data from the GPCSIV R package (Judge3 dataset).
data(judge3.int)data(judge3.int)
Interval-valued dataset from a lack-of-information questionnaire. Contains biographical data and responses to 5 items measuring perception of lack of information, collected via an interval-valued Likert scale.
data(lackinfo.int)data(lackinfo.int)
A data frame with 50 observations and 8 variables:
id: Identification number.
sex: Sex of the respondent (male or female).
age: Respondent's age (in years).
item1: Interval-valued answer to item 1.
item2: Interval-valued answer to item 2.
item3: Interval-valued answer to item 3.
item4: Interval-valued answer to item 4.
item5: Interval-valued answer to item 5.
An educational innovation project was carried out for improving teaching-learning processes at the University of Oviedo (Spain) for the 2020/2021 academic year. A total of 50 students answered an online questionnaire about biographical data (sex and age) and their perception of lack of information by selecting the interval that best represents their level of agreement on a scale bounded between 1 (strongly disagree) and 7 (strongly agree).
The 5 items measuring perception of lack of information are:
I1: I receive too little information from my classmates.
I2: It is difficult to receive relevant information from my classmates.
I3: It is difficult to receive relevant information from the teacher.
I4: The amount of information I receive from my classmates is very low.
I5: The amount of information I receive from the teacher is very low.
| Sample size (n) | 50 |
| Variables (p) | 8 |
| Subject area | Education |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics, Regression |
https://CRAN.R-project.org/package=IntervalQuestionStat
data(lackinfo.int)data(lackinfo.int)
Interval-valued daily air quality data from the Entrecampos monitoring
station in Lisbon, Portugal, covering 2019–2021 (1096 days). Each day's
pollutant concentration is represented as a interval
from hourly measurements. Missing days are imputed via linear
interpolation.
data(lisbon_air_quality.int)data(lisbon_air_quality.int)
A symbolic data frame (symbolic_tbl) with 1096 observations
(daily) and 8 interval-valued pollutant variables:
so2: Sulphur dioxide (ug/m3).
pm10: Particulate matter < 10 um (ug/m3).
o3: Ozone (ug/m3).
no2: Nitrogen dioxide (ug/m3).
co: Carbon monoxide (ug/m3).
pm25: Particulate matter < 2.5 um (ug/m3).
nox: Nitrogen oxides (ug/m3).
no: Nitric oxide (ug/m3).
| Sample size (n) | 1096 |
| Variables (p) | 8 |
| Subject area | Environment |
| Symbolic format | Interval |
| Analytical tasks | Regression, Time series |
QualAr, Entrecampos station, Lisbon, Portugal.
Dias, S. and Brito, P. (2017). Off the beaten track: A new linear model for interval data. European Journal of Operational Research, 258(3), 1118–1130.
Data from the QualAr Portuguese air quality monitoring network (https://qualar.apambiente.pt/).
data(lisbon_air_quality.int)data(lisbon_air_quality.int)
Interval-valued data for loan characteristics aggregated by their purpose. Original microdata contains 887,383 loan records from Kaggle.
data(loans_by_purpose.int)data(loans_by_purpose.int)
A data frame with 14 observations and 4 interval-valued variables:
ln_inc: Natural logarithm of self-reported annual income.
ln_revolbal: Natural logarithm of total credit revolving balance.
open_acc: Number of open credit lines.
total_acc: Total number of credit lines.
| Sample size (n) | 14 |
| Variables (p) | 4 |
| Subject area | Finance |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics, Clustering |
https://CRAN.R-project.org/package=MAINT.Data
data(loans_by_purpose.int)data(loans_by_purpose.int)
Interval-valued dataset of 35 Lending Club loan groups stratified by risk level (A1–G5). Intervals represent the 10th to 90th percentile range of each financial variable within each risk subgrade.
data(loans_by_risk_quantile.int)data(loans_by_risk_quantile.int)
A symbolic data frame (symbolic_tbl) with 35 observations
and 4 variables:
ln-inc: Interval-valued log income.
int-rate: Interval-valued interest rate.
open-acc: Interval-valued number of open accounts.
total-acc: Interval-valued total accounts.
| Sample size (n) | 35 |
| Variables (p) | 4 |
| Subject area | Finance |
| Symbolic format | Interval |
| Analytical tasks | Classification, Clustering |
MAINT.Data R package (LoansbyRiskLvs_qntlDt dataset).
Brito, P. and Duarte Silva, A.P. (2012). Modelling interval data with Normal and Skew-Normal distributions. Journal of Applied Statistics, 39(1), 3–20.
Original data from the MAINT.Data R package
(LoansbyRiskLvs_qntlDt dataset).
data(loans_by_risk_quantile.int)data(loans_by_risk_quantile.int)
Interval-valued dataset of 35 Lending Club loan groups classified by risk level (A through G, 5 groups each). Each group is described by 4 interval-valued financial variables.
data(loans_by_risk.int)data(loans_by_risk.int)
A symbolic data frame (symbolic_tbl) with 35 observations
and 5 variables:
log_income: Interval-valued log annual income.
interest_rate: Interval-valued interest rate (%).
open_accounts: Interval-valued number of open credit accounts.
total_accounts: Interval-valued total number of credit accounts.
risk_level: Risk grade factor (A, B, C, D, E, F, G).
Row names are A1–A5, B1–B5, ..., G1–G5.
| Sample size (n) | 35 |
| Variables (p) | 5 |
| Subject area | Finance |
| Symbolic format | Interval |
| Analytical tasks | Classification, Clustering |
MAINT.Data R package (LoansbyRisk_minmax dataset).
Brito, P. and Duarte Silva, A.P. (2012). Modelling interval data with Normal and Skew-Normal distributions. Journal of Applied Statistics, 39(1), 3–20.
Original data from the MAINT.Data R package.
data(loans_by_risk.int)data(loans_by_risk.int)
Histogram-valued distribution of lung cancer treatment counts for 2 US states (Massachusetts and New York).
data(lung_cancer.hist)data(lung_cancer.hist)
A data frame with 2 observations and 2 variables:
state: State name (character).
y30: Histogram-valued distribution of treatment counts
as a weighted set string (e.g., "{0, 0.77; 1, 0.08; 2, 0.15}").
| Sample size (n) | 2 |
| Variables (p) | 2 |
| Subject area | Medical |
| Symbolic format | Histogram |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.20.
data(lung_cancer.hist)data(lung_cancer.hist)
Interval-valued dataset of 10 observations with pulse rate, systolic pressure, and diastolic pressure intervals.
data(lynne1.int)data(lynne1.int)
A symbolic data frame (symbolic_tbl) with 10 observations
and 4 variables:
concept: Character concept label.
Pulse Rate: Interval-valued pulse rate (beats/min).
Systolic Pressure: Interval-valued systolic pressure (mmHg).
Diastolic Pressure: Interval-valued diastolic pressure (mmHg).
| Sample size (n) | 10 |
| Variables (p) | 4 |
| Subject area | Medical |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics, Regression |
RSDA R package (Lynne1 dataset).
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester.
Original data from the RSDA R package (Lynne1 dataset).
data(lynne1.int)data(lynne1.int)
Weekly minimum and maximum values of the Argentine MERVAL stock market index from January 4, 2016 to September 28, 2020 (248 weeks). Daily data was downloaded and aggregated to weekly intervals. This dataset matches the period used by de Carvalho and Martos (2022).
data(merval.its)data(merval.its)
A data frame with 248 observations and 3 variables:
date: Week start date, Monday (Date class).
low: Weekly minimum of daily low values.
high: Weekly maximum of daily high values.
The MERVAL (Mercado de Valores de Buenos Aires) is the main stock market index of the Buenos Aires Stock Exchange. Each observation represents one week, with the weekly low computed as the minimum of daily lows and the weekly high computed as the maximum of daily highs. The date column indicates the Monday (start) of each week. This period covers the Argentine economic crisis and the early COVID-19 pandemic impact.
| Sample size (n) | 248 |
| Variables (p) | 3 (date, low, high) |
| Subject area | Finance |
| Symbolic format | Interval time series (weekly aggregation) |
| Analytical tasks | Forecasting, Time series analysis |
Yahoo Finance, ticker ^MERV. Downloaded via the
quantmod package and aggregated from daily to weekly.
de Carvalho, F. A. T. and Martos, G. (2022). Modeling interval trendlines: Symbolic singular spectrum analysis for interval time series. Journal of Forecasting, 41(1), 167–180.
data(merval.its) head(merval.its) plot(merval.its$date, merval.its$high, type = "l", col = "red", ylab = "Index Value", xlab = "Date", main = "MERVAL Weekly Min/Max (2016-2020)") lines(merval.its$date, merval.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)data(merval.its) head(merval.its) plot(merval.its$date, merval.its$high, type = "l", col = "red", ylab = "Index Value", xlab = "Date", main = "MERVAL Weekly Min/Max (2016-2020)") lines(merval.its$date, merval.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)
Convert MM format (paired _min/_max columns) to a
3-dimensional array [n, p, 2].
MM_to_ARRAY(data)MM_to_ARRAY(data)
data |
A data.frame in MM format with paired |
A numeric array of dimension [n, p, 2] with dimnames.
Non-interval columns are excluded.
data(mushroom.int) mm <- RSDA_to_MM(mushroom.int, RSDA = FALSE) arr <- MM_to_ARRAY(mm) dim(arr)data(mushroom.int) mm <- RSDA_to_MM(mushroom.int, RSDA = FALSE) arr <- MM_to_ARRAY(mm) dim(arr)
To convert MM format to iGAP format.
MM_to_iGAP(data)MM_to_iGAP(data)
data |
The dataframe with the MM format. |
Return a dataframe with the iGAP format.
data(face.iGAP) face <- iGAP_to_MM(face.iGAP, 1:6) MM_to_iGAP(face)data(face.iGAP) face <- iGAP_to_MM(face.iGAP, 1:6) MM_to_iGAP(face)
To convert MM format interval dataframe to RSDA format (symbolic_tbl).
MM_to_RSDA(data)MM_to_RSDA(data)
data |
The dataframe with the MM format (paired _min/_max columns). |
Return a symbolic_tbl dataframe with complex-encoded interval columns.
data(mushroom.int) mm <- RSDA_to_MM(mushroom.int, RSDA = FALSE) rsda <- MM_to_RSDA(mm)data(mushroom.int) mm <- RSDA_to_MM(mushroom.int, RSDA = FALSE) rsda <- MM_to_RSDA(mm)
Mixed symbolic dataset of 5 car groups from the mtcars data,
with 7 interval-valued performance variables and 4 modal-valued
categorical variables.
data(mtcars.mix)data(mtcars.mix)
A symbolic data frame (symbolic_tbl) with 5 observations
(car groups) and 11 variables:
mpg: Interval-valued miles per gallon.
cyl: Modal-valued number of cylinders.
disp: Interval-valued displacement (cu.in.).
hp: Interval-valued horsepower.
drat: Interval-valued rear axle ratio.
wt: Interval-valued weight (1000 lbs).
qsec: Interval-valued quarter-mile time (seconds).
vs: Modal-valued engine type (V/S).
am: Modal-valued transmission type (auto/manual).
gear: Modal-valued number of forward gears.
carb: Modal-valued number of carburetors.
| Sample size (n) | 5 |
| Variables (p) | 11 |
| Subject area | Automotive |
| Symbolic format | Mixed (interval, modal) |
| Analytical tasks | Descriptive statistics, Clustering |
ggESDA R package (mtcars.i dataset).
Henderson, R. and Velleman, P. (1981). Building multiple regression models interactively. Biometrics, 37, 391–411.
Original data from the ggESDA R package (mtcars.i dataset).
data(mtcars.mix)data(mtcars.mix)
Extended mushroom data with fuzzy stipe thickness (Small/Average/Large), numerical stipe length, interval cap size, and categorical cap colour for two Amanita species (4 specimens).
data(mushroom_fuzzy.mix)data(mushroom_fuzzy.mix)
A data frame with 4 observations (Mushroom1–Mushroom4) and 9 variables:
specimen: Specimen identifier (character).
species: Species name (character).
stipe_thickness: Stipe thickness measurement (numeric, cm).
fuzzy_small: Fuzzy membership degree for Small (numeric, 0–1).
fuzzy_average: Fuzzy membership degree for Average (numeric, 0–1).
fuzzy_large: Fuzzy membership degree for Large (numeric, 0–1).
stipe_length: Stipe length (numeric, cm).
cap_size: Cap size as interval string (e.g., "24 +/- 1", character).
cap_colour: Cap colour (character).
| Sample size (n) | 4 |
| Variables (p) | 9 |
| Subject area | Biology |
| Symbolic format | Fuzzy |
| Analytical tasks | Descriptive statistics |
Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Tables 1.14-1.16.
data(mushroom_fuzzy.mix)data(mushroom_fuzzy.mix)
Interval-valued version of the mushroom dataset. See mushroom.int.mm.
data(mushroom.int)data(mushroom.int)
A symbolic data frame (symbolic_tbl) with 23 observations and 5 variables:
Species: Mushroom species name (character).
Pileus.Cap.Width: Pileus cap width range (cm, interval).
Stipe.Length: Stipe length range (cm, interval).
Stipe.Thickness: Stipe thickness range (cm, interval).
Edibility: Edibility code (U = Unknown, Y = Yes, N = No, T = Toxic; character).
| Sample size (n) | 23 |
| Variables (p) | 5 |
| Subject area | Biology |
| Symbolic format | Interval |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 3.2.
data(mushroom.int)data(mushroom.int)
Interval-valued data for 23 mushroom species of the genus Agaricus with 3 morphological measurements from the Fungi of California Species.
data(mushroom.int.mm)data(mushroom.int.mm)
A data frame with 23 observations and 5 variables:
Species: Mushroom species name.
Pileus.Cap.Width: Pileus cap width range (cm).
Stipe.Length: Stipe length range (cm).
Stipe.Thickness: Stipe thickness range (cm).
Edibility: Edibility code (U/Y/N/T).
Classic SDA dataset used for descriptive statistics, histogram construction, and clustering of interval-valued data.
| Sample size (n) | 23 |
| Variables (p) | 5 |
| Subject area | Biology |
| Symbolic format | Interval |
| Analytical tasks | Clustering, Descriptive statistics |
Billard, L. and Diday, E. (2006), Table 3.2.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. Table 3.2.
data(mushroom.int.mm)data(mushroom.int.mm)
Interval-valued dataset with 142 units and four interval-valued variables from the nycflights13 package, aggregated by month and carrier.
data(nycflights.int)data(nycflights.int)
A symbolic data frame (symbolic_tbl) with 142 observations and 5 variables:
X: Month-carrier identifier (character).
dep_delay: Departure delay range (minutes, interval).
arr_delay: Arrival delay range (minutes, interval).
air_time: Air time range (minutes, interval).
distance: Distance range (miles, interval).
| Sample size (n) | 142 |
| Variables (p) | 5 |
| Subject area | Transportation |
| Symbolic format | Interval |
| Analytical tasks | Regression, Descriptive statistics |
https://CRAN.R-project.org/package=MAINT.Data
Duarte Silva, A.P., Brito, P., Filzmoser, P. and Dias, J.G. (2021). MAINT.Data: Modelling and Analysing Interval Data in R. R Journal, 13(2).
data(nycflights.int)data(nycflights.int)
Modal-valued dataset of 9 occupations with gender and salary distributions.
This is the wide (flat table) format; see occupations2.modal for the
modal-valued version.
data(occupations.modal)data(occupations.modal)
A data frame with 9 observations and 11 columns:
Occupation: Occupation name (character).
Gender(M), Gender(F): Proportion male/female (2 bins).
Salary(1) through Salary(7): Salary distribution
across 7 ordered bins (proportions).
n: Sample size (integer).
| Sample size (n) | 9 |
| Variables (p) | 11 |
| Subject area | Sociology |
| Symbolic format | Modal |
| Analytical tasks | Descriptive statistics, Clustering |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.
data(occupations.modal)data(occupations.modal)
Modal-valued version of the occupation salaries dataset.
See occupations.modal for the wide-format version.
data(occupations2.modal)data(occupations2.modal)
A symbolic data frame (symbolic_tbl) with 9 observations and 4 variables:
Occupation: Occupation name (character).
Gender: Modal distribution over gender (Male, Female).
Salary: Modal distribution over 7 ordered salary bins.
n: Sample size (numeric).
| Sample size (n) | 9 |
| Variables (p) | 4 |
| Subject area | Sociology |
| Symbolic format | Modal |
| Analytical tasks | Descriptive statistics, Clustering |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.
data(occupations2.modal)data(occupations2.modal)
Interval-valued dataset of 30-year trimmed mean daily temperatures for the Ohio river basin. Intervals are defined by the mean daily maximum and minimum temperatures from January 1, 1988 to December 31, 2018.
data(ohtemp.int)data(ohtemp.int)
A data frame with 161 rows and 7 variables:
ID: Global Historical Climatological Network (GHCN) station identifier.
NAME: GHCN station name.
STATE: Two-digit state designation.
LATITUDE: Latitude coordinate position.
LONGITUDE: Longitude coordinate position.
ELEVATION: Elevation of the measurement location (meters).
TEMPERATURE: 30-year mean daily temperature (tenths of degrees Celsius).
| Sample size (n) | 161 |
| Variables (p) | 7 |
| Subject area | Climate |
| Symbolic format | Interval |
| Analytical tasks | Regression, Spatial analysis |
https://CRAN.R-project.org/package=intkrige
data(ohtemp.int)data(ohtemp.int)
Classic benchmark interval-valued data for 8 oils and fats described by 4 physico-chemical properties. Originally from Ichino (1988).
data(oils.int)data(oils.int)
A data frame with 8 observations and 9 columns (4 interval variables
in _l/_u Min-Max pairs, plus a label):
sample: Oil/fat sample name (character).
specific_gravity_l, specific_gravity_u: Specific gravity range.
freezing_point_l, freezing_point_u: Freezing point range (degrees Celsius).
iodine_value_l, iodine_value_u: Iodine value range.
saponification_value_l, saponification_value_u: Saponification value range.
The 8 samples are: Linseed oil, Perilla oil, Cottonseed oil, Sesame oil, Camellia oil, Olive oil, Beef tallow, Hog fat. The expected 3-cluster structure is: {Beef tallow, Hog fat}, {Cottonseed, Sesame, Camellia, Olive}, and {Linseed, Perilla}. Widely used for comparing clustering methods and distance measures in symbolic data analysis.
| Sample size (n) | 8 |
| Variables (p) | 9 |
| Subject area | Chemistry |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
Ichino, M. (1988). General metrics for mixed features. Proc. IEEE Conf. Systems, Man, and Cybernetics, pp. 494-497.
Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Table 13.7, p.253.
data(oils.int)data(oils.int)
Histogram-valued dataset of 84 daily observations with 4 weather-related histogram variables. Each histogram has 10 equal-probability (decile) bins summarizing hourly measurements within each day.
data(ozone.hist)data(ozone.hist)
A data frame with 84 observations (days) and 4 histogram-valued variables:
Ozone.Conc.ppb: Histogram of ozone concentration (ppb).
Temperature.C: Histogram of temperature (Celsius).
Solar.Radiation.WattM2: Histogram of solar radiation (W/m^2).
Wind.Speed.mSec: Histogram of wind speed (m/s).
Row names are I1 through I84.
| Sample size (n) | 84 |
| Variables (p) | 4 |
| Subject area | Environment |
| Symbolic format | Histogram |
| Analytical tasks | Regression, Clustering |
HistDAWass R package (OzoneH dataset).
Irpino, A. and Verde, R. (2015). Basic statistics for distributional symbolic variables: A new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.
Original data from the HistDAWass R package (OzoneH dataset),
reduced from 100 quantile bins to 10 decile bins.
data(ozone.hist)data(ozone.hist)
Daily high and low stock prices of Petrobras (ADR traded on NYSE) from January 3, 2005 to December 29, 2006 (503 trading days). This dataset matches the period used by Maia, de Carvalho and Ludermir (2008) in their work on forecasting models for interval-valued time series.
data(petrobras.its)data(petrobras.its)
A data frame with 503 observations and 3 variables:
date: Trading date (Date class).
low: Daily low price (USD).
high: Daily high price (USD).
Petrobras (Petroleo Brasileiro S.A.) is the Brazilian multinational petroleum corporation. The ADR (American Depositary Receipt) is traded on the New York Stock Exchange under ticker PBR. Each observation represents a trading day with the daily low and high prices forming an interval. This was one of the first datasets used to demonstrate interval-valued autoregressive (iAR) models.
| Sample size (n) | 503 |
| Variables (p) | 3 (date, low, high) |
| Subject area | Finance |
| Symbolic format | Interval time series |
| Analytical tasks | Forecasting, Time series analysis |
Yahoo Finance, ticker PBR. Downloaded via the
quantmod package.
Maia, A. L. S., de Carvalho, F. A. T. and Ludermir, T. B. (2008). Forecasting models for interval-valued time series. Neurocomputing, 71(16–18), 3344–3352.
data(petrobras.its) head(petrobras.its) plot(petrobras.its$date, petrobras.its$high, type = "l", col = "red", ylab = "Price (USD)", xlab = "Date", main = "Petrobras Daily High/Low (2005-2006)") lines(petrobras.its$date, petrobras.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)data(petrobras.its) head(petrobras.its) plot(petrobras.its$date, petrobras.its$high, type = "l", col = "red", ylab = "Price (USD)", xlab = "Date", main = "Petrobras Daily High/Low (2005-2006)") lines(petrobras.its$date, petrobras.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)
Mixed symbolic dataset of 30 car models sold in Poland, with 9 interval-valued technical specification variables and 3 multinomial-valued categorical variables.
data(polish_cars.mix)data(polish_cars.mix)
A symbolic data frame (symbolic_tbl) with 30 observations
and 12 variables:
price: Interval-valued price (PLN).
body: Multinomial body types (e.g., hatchback, sedan, combi).
wheelbase: Interval-valued wheelbase (mm).
chassis_length: Interval-valued chassis length (mm).
chassis_width: Interval-valued chassis width (mm).
chassis_height: Interval-valued chassis height (mm).
engine_capacity: Multinomial engine displacement categories (litres).
engine_power: Interval-valued engine power (HP).
maximum_speed: Interval-valued maximum speed (km/h).
acceleration: Interval-valued 0–100 km/h time (seconds).
fuel_type: Multinomial fuel types (petrol, diesel, LPG).
fuel_consumption: Interval-valued fuel consumption (L/100km).
| Sample size (n) | 30 |
| Variables (p) | 12 |
| Subject area | Automotive |
| Symbolic format | Mixed (interval, multinomial) |
| Analytical tasks | Clustering, Descriptive statistics |
symbolicDA R package (cars dataset).
Dudek, A. and Pelka, M. (2012). symbolicDA: Analysis of Symbolic Data. R package.
data(polish_cars.mix)data(polish_cars.mix)
Interval-valued dataset of 18 Polish voivodships (administrative regions) with 9 socio-economic interval variables describing demographic and economic characteristics at the county (powiat) level.
data(polish_voivodships.int)data(polish_voivodships.int)
A symbolic data frame (symbolic_tbl) with 18 observations
(voivodships) and 9 interval-valued variables:
V1 through V9: Interval-valued socio-economic
indicators aggregated across counties within each voivodship.
Row names are voivodship names (e.g., Dolnoslaskie, Lubelskie).
| Sample size (n) | 18 |
| Variables (p) | 9 |
| Subject area | Socioeconomics |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
clusterSim R package (data_pathtinger dataset).
Dudek, A. and Pelka, M. (2022). symbolicDA: Analysis of Symbolic Data. R package.
Walesiak, M. and Dudek, A. (2020). clusterSim: Searching for Optimal Clustering Procedure for a Data Set. R package.
data(polish_voivodships.int)data(polish_voivodships.int)
Interval-valued data for 15 profession entries classified by work type (White Collar / Blue Collar). Each entry describes a specific profession with salary and working duration ranges.
data(profession.int)data(profession.int)
A symbolic data frame (symbolic_tbl) with 15 observations and 4 variables:
Type_of_Work: Work category (White Collar or Blue Collar, character).
Profession: Profession name (character).
Salary: Salary range (currency units, interval).
Duration: Working duration range (hours per week, interval).
| Sample size (n) | 15 |
| Variables (p) | 4 |
| Subject area | Sociology |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics, Classification |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.
data(profession.int)data(profession.int)
Interval-valued clinical measurements for 97 prostate cancer patients (training and test sets combined). Contains 9 interval-valued variables from log-transformed cancer volume, weight, age, and other clinical predictors.
data(prostate.int)data(prostate.int)
A data frame with 97 observations and 9 interval-valued variables:
lcavol: Log cancer volume range.
lweight: Log prostate weight range.
age: Patient age range.
lbph: Log benign prostatic hyperplasia amount range.
svi: Seminal vesicle invasion range.
lcp: Log capsular penetration range.
gleason: Gleason score range.
pgg45: Percentage Gleason scores 4 or 5 range.
lpsa: Log prostate specific antigen range.
| Sample size (n) | 97 |
| Variables (p) | 9 |
| Subject area | Medical |
| Symbolic format | Interval |
| Analytical tasks | Regression |
Extracted from RSDA package (int_prost_train, int_prost_test).
Stamey, T. et al. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. J. Urology, 141(5), 1076-1083.
data(prostate.int)data(prostate.int)
Reads an external CSV file containing symbolic data, automatically detects whether the data is interval-valued (min/max pairs or comma-separated), histogram-valued, modal-valued, or another symbolic type, and returns an appropriate R object.
read_symbolic_csv( file, sep = ",", header = TRUE, row.names = NULL, stringsAsFactors = FALSE, na.strings = c("", "NA"), symbolic_type = NULL, ... )read_symbolic_csv( file, sep = ",", header = TRUE, row.names = NULL, stringsAsFactors = FALSE, na.strings = c("", "NA"), symbolic_type = NULL, ... )
file |
Path to the CSV file to read. |
sep |
Field separator character. Default |
header |
Logical; does the first row contain column names?
Default |
row.names |
Column number or character string giving row names.
Passed to |
stringsAsFactors |
Logical; should character columns be converted to
factors? Default |
na.strings |
Character vector of strings to interpret as |
symbolic_type |
Optional character string to override automatic type
detection. One of |
... |
Additional arguments passed to |
The detection heuristic works as follows:
Interval (MM): If the file contains paired
_min/_max columns the data is returned as-is (MM format).
Interval (iGAP): If one or more character columns contain
comma-separated numeric pairs (e.g., "1.2,3.4") they are
expanded into _min/_max column pairs and the result is
returned in MM format.
Histogram / Modal: If columns follow a VarName(bin)
naming pattern (e.g., Crime(violent)) and the proportions within
each variable group sum to approximately 1, the data is classified as
histogram or modal. It is returned as a plain data.frame.
Other: If none of the above patterns match, the data is
returned as a plain data.frame.
A data.frame. Interval data is returned in MM format
(paired _min/_max columns). All other symbolic types are
returned as plain data frames.
write_symbolic_csv, int_detect_format,
int_convert_format
# Write then read back an interval dataset data(mushroom.int.mm) tmp <- tempfile(fileext = ".csv") write_symbolic_csv(mushroom.int.mm, tmp) df <- read_symbolic_csv(tmp) head(df) # Write then read back a histogram dataset data(airline_flights.hist) tmp2 <- tempfile(fileext = ".csv") write_symbolic_csv(airline_flights.hist, tmp2) df2 <- read_symbolic_csv(tmp2) head(df2)# Write then read back an interval dataset data(mushroom.int.mm) tmp <- tempfile(fileext = ".csv") write_symbolic_csv(mushroom.int.mm, tmp) df <- read_symbolic_csv(tmp) head(df) # Write then read back a histogram dataset data(airline_flights.hist) tmp2 <- tempfile(fileext = ".csv") write_symbolic_csv(airline_flights.hist, tmp2) df2 <- read_symbolic_csv(tmp2) head(df2)
This function changes the format of the data to conform to RSDA format.
RSDA_format(data, sym_type1 = NULL, location = NULL, sym_type2 = NULL, var = NULL)RSDA_format(data, sym_type1 = NULL, location = NULL, sym_type2 = NULL, var = NULL)
data |
A conventional data. |
sym_type1 |
The labels I means an interval variable and $S means set variable. |
location |
The location of the sym_type in the data. |
sym_type2 |
The labels I means an interval variable and $S means set variable. |
var |
The name of the symbolic variable in the data. |
Return a dataframe with a label added to the previous column of symbolic variable.
data("mushroom.int.mm") mushroom.set <- set_variable_format(data = mushroom.int.mm, location = 8, var = "Species") mushroom.tmp <- RSDA_format(data = mushroom.set, sym_type1 = c("I", "S"), location = c(25, 31), sym_type2 = c("S", "I", "I"), var = c("Species", "Stipe.Length_min", "Stipe.Thickness_min"))data("mushroom.int.mm") mushroom.set <- set_variable_format(data = mushroom.int.mm, location = 8, var = "Species") mushroom.tmp <- RSDA_format(data = mushroom.set, sym_type1 = c("I", "S"), location = c(25, 31), sym_type2 = c("S", "I", "I"), var = c("Species", "Stipe.Length_min", "Stipe.Thickness_min"))
Convert RSDA format (symbolic_tbl) to a 3-dimensional array
[n, p, 2] where slice [,,1] contains the minima and
slice [,,2] contains the maxima.
RSDA_to_ARRAY(data)RSDA_to_ARRAY(data)
data |
A symbolic_tbl with interval columns. |
A numeric array of dimension [n, p, 2] with dimnames.
Only interval (symbolic_interval) columns are included.
data(mushroom.int) arr <- RSDA_to_ARRAY(mushroom.int) dim(arr) # [23, 3, 2]data(mushroom.int) arr <- RSDA_to_ARRAY(mushroom.int) dim(arr) # [23, 3, 2]
To convert RSDA format interval dataframe to iGAP format.
RSDA_to_iGAP(data)RSDA_to_iGAP(data)
data |
The RSDA format with interval dataframe. |
Return a dataframe with the iGAP format.
data(mushroom.int) RSDA_to_iGAP(mushroom.int)data(mushroom.int) RSDA_to_iGAP(mushroom.int)
To convert RSDA format interval dataframe to MM format.
RSDA_to_MM(data, RSDA = TRUE)RSDA_to_MM(data, RSDA = TRUE)
data |
The RSDA format with interval dataframe. |
RSDA |
Whether to load the RSDA package. |
Return a dataframe with the MM format.
data(mushroom.int) RSDA_to_MM(mushroom.int, RSDA = FALSE)data(mushroom.int) RSDA_to_MM(mushroom.int, RSDA = FALSE)
Search and filter the dataSDA dataset catalog by metadata criteria including sample size, number of variables, subject area, symbolic format, analytical tasks, keywords, and book reference.
search_data(...)search_data(...)
... |
Filter expressions. Each argument is a comparison expression evaluated against the dataset metadata. Supported columns:
|
For character columns (subject, type, task, tag,
book), the == operator performs a case-insensitive substring
match (using grepl). The type column uses short suffix-based
labels that match the dataset name suffix (e.g., type == "int"
matches all .int datasets).
For numeric columns (n, p), standard comparison operators
are used with exact semantics.
When no arguments are provided, or when tag == "all" is used,
all datasets are returned.
A data frame with one row per matching dataset and the following
columns: name, n, p, subject, type,
task, tag, book.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley.
Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley.
# List all datasets search_data() # Filter by symbolic format (suffix-based) search_data(type == "hist") # Filter by analytical task and size search_data(task == "Regression", n > 10) # Filter by book reference search_data(book == "SDA_2006") # Combine multiple filters search_data(type == "int", task == "Clustering", subject == "Biology") # Filter by size range search_data(n >= 20, n <= 100, p < 10)# List all datasets search_data() # Filter by symbolic format (suffix-based) search_data(type == "hist") # Filter by analytical task and size search_data(task == "Regression", n > 10) # Filter by book reference search_data(book == "SDA_2006") # Combine multiple filters search_data(type == "int", task == "Clustering", subject == "Biology") # Filter by size range search_data(n >= 20, n <= 100, p < 10)
This function changes the format of the set variables in the data to conform to the RSDA format.
set_variable_format(data, location = NULL, var = NULL)set_variable_format(data, location = NULL, var = NULL)
data |
A conventional data. |
location |
The location of the set variable in the data. |
var |
The name of the set variable in the data. |
Return a dataframe in which a set variable is converted to one-hot encoding.
data("mushroom.int.mm") mushroom.set <- set_variable_format(data = mushroom.int.mm, location = 8, var = "Species")data("mushroom.int.mm") mushroom.set <- set_variable_format(data = mushroom.int.mm, location = 8, var = "Species")
Daily high and low values of the Shanghai Stock Exchange Composite Index (SSE Composite) from January 2, 2019 to December 30, 2022 (970 trading days). This dataset matches the period used by Yang, Zhang and Wang (2025) for interval time series forecasting.
data(shanghai_stock.its)data(shanghai_stock.its)
A data frame with 970 observations and 3 variables:
date: Trading date (Date class).
low: Daily low value of the SSE Composite Index.
high: Daily high value of the SSE Composite Index.
The SSE Composite Index is the most commonly used indicator to reflect the performance of the Shanghai Stock Exchange. It tracks all stocks (A-shares and B-shares) listed on the exchange. This dataset covers a period that includes the COVID-19 pandemic and its market impacts, providing a rich testbed for evaluating interval forecasting models under extreme volatility.
| Sample size (n) | 970 |
| Variables (p) | 3 (date, low, high) |
| Subject area | Finance |
| Symbolic format | Interval time series |
| Analytical tasks | Forecasting, Time series analysis |
Yahoo Finance, ticker 000001.SS. Downloaded via the
quantmod package.
Yang, W., Zhang, S. and Wang, S. (2025). On smooth transition interval autoregressive models. Journal of Forecasting, 44(2), 310–332.
data(shanghai_stock.its) head(shanghai_stock.its) plot(shanghai_stock.its$date, shanghai_stock.its$high, type = "l", col = "red", ylab = "Index Value", xlab = "Date", main = "Shanghai Composite Daily High/Low (2019-2022)") lines(shanghai_stock.its$date, shanghai_stock.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)data(shanghai_stock.its) head(shanghai_stock.its) plot(shanghai_stock.its$date, shanghai_stock.its$high, type = "l", col = "red", ylab = "Index Value", xlab = "Date", main = "Shanghai Composite Daily High/Low (2019-2022)") lines(shanghai_stock.its$date, shanghai_stock.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)
Small simulated histogram-valued dataset of 5 observations with 2 histogram-valued variables. Useful for testing and demonstrating histogram-valued statistical methods.
data(simulated.hist)data(simulated.hist)
A data frame with 5 observations and 2 histogram-valued variables:
Y1: Histogram-valued variable 1.
Y2: Histogram-valued variable 2.
Row names are Obs_1 through Obs_5.
| Sample size (n) | 5 |
| Variables (p) | 2 |
| Subject area | Methodology |
| Symbolic format | Histogram |
| Analytical tasks | Clustering |
Billard, L. and Diday, E. (2020), Table 7-26.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 7-26.
data(simulated.hist)data(simulated.hist)
Interval-valued data for 20 teams from the French premier soccer championship. Contains ranges of Weight (response), Height and Age (explanatory variables).
data(soccer_bivar.int)data(soccer_bivar.int)
A data frame with 20 rows and 3 interval-valued variables:
y: Weight (response variable, kg).
t1: Height (explanatory variable, cm).
t2: Age (explanatory variable, years).
| Sample size (n) | 20 |
| Variables (p) | 3 |
| Subject area | Sports |
| Symbolic format | Interval |
| Analytical tasks | Regression |
https://CRAN.R-project.org/package=iRegression
Lima Neto, E. A., Cordeiro, G. and De Carvalho, F.A.T. (2011). Bivariate symbolic regression models for interval-valued variables. Journal of Statistical Computation and Simulation, 81, 1727-1744.
data(soccer_bivar.int)data(soccer_bivar.int)
Convert SODAS format (XML file) to a 3-dimensional array
[n, p, 2].
SODAS_to_ARRAY(XMLPath)SODAS_to_ARRAY(XMLPath)
XMLPath |
Disk path where the SODAS |
A numeric array of dimension [n, p, 2] with dimnames.
## Not run: arr <- SODAS_to_ARRAY("C:/Users/user/AppData/abalone.xml") ## End(Not run)## Not run: arr <- SODAS_to_ARRAY("C:/Users/user/AppData/abalone.xml") ## End(Not run)
To convert SODAS format interval dataframe to the iGAP format.
SODAS_to_iGAP(XMLPath)SODAS_to_iGAP(XMLPath)
XMLPath |
Disk path where the SODAS *.XML file is. |
Return a dataframe with the iGAP format.
## Not run: # Read from a SODAS XML file: abalone <- SODAS_to_iGAP("C:/Users/user/AppData/abalone.xml") ## End(Not run)## Not run: # Read from a SODAS XML file: abalone <- SODAS_to_iGAP("C:/Users/user/AppData/abalone.xml") ## End(Not run)
To convert SODAS format interval dataframe to the MM format.
SODAS_to_MM(XMLPath)SODAS_to_MM(XMLPath)
XMLPath |
Disk path where the SODAS *.XML file is. |
Return a dataframe with the MM format.
## Not run: # Read from a SODAS XML file: abalone <- SODAS_to_MM("C:/Users/user/AppData/abalone.xml") ## End(Not run)## Not run: # Read from a SODAS XML file: abalone <- SODAS_to_MM("C:/Users/user/AppData/abalone.xml") ## End(Not run)
Daily high and low prices of the S&P 500 index from January 2, 2004 to December 30, 2005 (504 trading days). This dataset is a benchmark for interval time series forecasting, matching the period used in the foundational work by Arroyo, Gonzalez-Rivera and Mate (2011).
data(sp500.its)data(sp500.its)
A data frame with 504 observations and 3 variables:
date: Trading date (Date class).
low: Daily low price of the S&P 500 index.
high: Daily high price of the S&P 500 index.
The S&P 500 is a market-capitalization-weighted index of 500 leading publicly traded companies in the United States. Each observation represents a trading day with the daily low and high prices forming an interval. This dataset has been widely used to evaluate interval-valued autoregressive models, exponential smoothing methods for intervals, and center-and-range forecasting approaches.
| Sample size (n) | 504 |
| Variables (p) | 3 (date, low, high) |
| Subject area | Finance |
| Symbolic format | Interval time series |
| Analytical tasks | Forecasting, Time series analysis |
Yahoo Finance, ticker ^GSPC. Downloaded via the
quantmod package.
Arroyo, J., Gonzalez-Rivera, G. and Mate, C. (2011). Forecasting with interval and histogram data: Some financial applications. In Handbook of Empirical Economics and Finance, pp. 247–280. Chapman and Hall/CRC.
data(sp500.its) head(sp500.its) plot(sp500.its$date, sp500.its$high, type = "l", col = "red", ylab = "Price", xlab = "Date", main = "S&P 500 Daily High/Low") lines(sp500.its$date, sp500.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)data(sp500.its) head(sp500.its) plot(sp500.its$date, sp500.its$high, type = "l", col = "red", ylab = "Price", xlab = "Date", main = "S&P 500 Daily High/Low") lines(sp500.its$date, sp500.its$low, col = "blue") legend("topleft", c("High", "Low"), col = c("red", "blue"), lty = 1)
Histogram-valued dataset of 6 US states with 4 income distribution histograms. Each histogram describes the distribution of household income within a state.
data(state_income.hist)data(state_income.hist)
A data frame with 6 observations (states) and 4 histogram-valued variables:
Y1 through Y4: Histogram-valued income distribution
variables.
Row names are State_1 through State_6.
| Sample size (n) | 6 |
| Variables (p) | 4 |
| Subject area | Economics |
| Symbolic format | Histogram |
| Analytical tasks | Clustering |
Billard, L. and Diday, E. (2020), Table 7-18.
Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley, Chichester. Table 7-18.
data(state_income.hist)data(state_income.hist)
Synthetic interval-valued dataset with 125 observations in 5 groups of 25 each, described by 6 interval-valued variables and a cluster label. Designed for benchmarking interval data clustering algorithms.
data(synthetic_clusters.int)data(synthetic_clusters.int)
A symbolic data frame (symbolic_tbl) with 125 observations and 7 variables:
V1 through V6: Six interval-valued variables.
class: Cluster membership (1–5, set-valued).
| Sample size (n) | 125 |
| Variables (p) | 7 |
| Subject area | Methodology |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
Extracted from symbolicDA package (data_symbolic).
Dudek, A. and Pelka, M. (2022). symbolicDA: Analysis of Symbolic Data. R package.
data(synthetic_clusters.int)data(synthetic_clusters.int)
Interval-valued data for 5 teams in a local pickup league, classified by season performance. Each team is described by ranges of player age, weight, and speed.
data(teams.int)data(teams.int)
A data frame with 5 observations and 7 columns (3 interval variables
in _l/_u Min-Max pairs, plus a label):
team_type: Performance category (Very Good, Good, Average, Fair, Poor).
age_l, age_u: Player age range (years).
weight_l, weight_u: Player weight range (pounds).
speed_l, speed_u: Speed range – time to run 100 yards (seconds).
The symbolic results are more informative than classical midpoint analyses: the Very Good team has homogeneous players, whereas the Poor team has players varying widely in age, weight, and speed. Used for symbolic principal component analysis.
| Sample size (n) | 5 |
| Variables (p) | 7 |
| Subject area | Sports |
| Symbolic format | Interval |
| Analytical tasks | PCA |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.24, p.63.
data(teams.int)data(teams.int)
Interval-valued monthly temperatures for major cities worldwide. Benchmark dataset for comparing distance measures (Hausdorff, L2, Wasserstein) in dynamic clustering algorithms.
data(temperature_city.int)data(temperature_city.int)
A data frame with 6 observations and 13 columns (6 monthly interval
variables in _l/_u Min-Max pairs, plus a label). Only
January through June are included:
city: City name (character).
jan_l, jan_u: January temperature range (degrees Celsius).
feb_l, feb_u: February temperature range.
mar_l, mar_u: March temperature range.
apr_l, apr_u: April temperature range.
may_l, may_u: May temperature range.
jun_l, jun_u: June temperature range.
Expert partition into 4 classes: Class 1 (tropical/warm), Class 2 (temperate European and Asian), Class 3 (Mauritius), Class 4 (Tehran).
| Sample size (n) | 6 |
| Variables (p) | 13 |
| Subject area | Climate |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
Verde, R. and Irpino, A. (2008). A new interval data distance based on the Wasserstein metric. Proc. COMPSTAT 2008, pp. 705-712.
data(temperature_city.int)data(temperature_city.int)
Interval-valued data for tennis players aggregated by court type (Hard, Grass, Indoor, Clay) with weight, height, and racket tension.
data(tennis.int)data(tennis.int)
A data frame with 4 observations and 7 columns (3 interval variables
in _l/_u Min-Max pairs, plus a label):
court_type: Type of court (Hard, Grass, Indoor, Clay).
player_weight_l, player_weight_u: Player weight range (kg).
player_height_l, player_height_u: Player height range (m).
racket_tension_l, racket_tension_u: Racket tension range.
Clustering on weight and height separates grass courts from the rest (decision rule: Weight <= 74.75 kg). When all three variables are used, clustering separates by racket tension instead.
| Sample size (n) | 4 |
| Variables (p) | 7 |
| Subject area | Sports |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.25, p.64.
data(tennis.int)data(tennis.int)
Convert interval data from any recognized format to all six supported interval data formats and return the results as a named list. This is useful for inspecting and comparing how the same interval data is represented across different formats.
to_all_interval_formats(x, ...)to_all_interval_formats(x, ...)
x |
Interval data in one of the supported formats:
|
... |
Additional arguments passed to conversion functions (e.g.,
|
Six interval data formats are supported in this package. Each format stores the same information – lower and upper bounds for every variable of every observation – but differs in its structure and origin:
A symbolic_tbl object (class
c("symbolic_tbl", "tbl_df", "tbl", "data.frame")) where each
interval variable is a complex column (symbolic_interval):
Re() gives the minimum and Im() gives the maximum.
This is the native format of the RSDA package
(Billard & Diday, 2006; Rodriguez, 2024).
A plain data.frame where each interval variable is represented by
two numeric columns named <var>_min and <var>_max.
This is a widely used general-purpose representation.
A data.frame where each interval variable is stored as a character
column with comma-separated values "min,max".
This is the format used by the iGAP software (Correia, 2009).
A three-dimensional numeric array of size [n, p, 2].
The first slice [,,1] contains all minima and the second slice
[,,2] contains all maxima. Dimnames encode observation labels,
variable names, and c("min", "max"). This format is convenient
for matrix-based computations.
An XML file on disk produced by the SODAS software (Diday & Noirhomme,
2008). In R, SODAS data is referenced by its file path and read via
RSDA::SODAS.to.RSDA(). Since SODAS is a file-based format, it
cannot be generated from in-memory data.
An alias for SODAS. Both refer to the same XML-based format.
A named list with six slots:
RSDAA symbolic_tbl with complex-encoded
symbolic_interval columns.
MMA data.frame with paired _min/_max
columns.
iGAPA data.frame with comma-separated
"min,max" character values.
ARRAYA three-dimensional numeric array of dimension
[n, p, 2] where [,,1] stores minima and [,,2]
stores maxima.
SODASNULL unless the input is a SODAS XML file path,
in which case it stores the original path.
SDSNULL unless the input is a SODAS/SDS XML file path
(alias for SODAS).
Han-Ming Wu
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley.
Rodriguez, O. (2024). RSDA: R to Symbolic Data Analysis. R package, https://CRAN.R-project.org/package=RSDA.
Correia, M. (2009). Interval GARCH and Aggregation of Predictions.
Diday, E. and Noirhomme-Fraiture, M. (2008). Symbolic Data Analysis and the SODAS Software. Wiley.
int_detect_format, int_convert_format,
int_list_conversions
data(car.int) result <- to_all_interval_formats(car.int) names(result) # RSDA format (symbolic_tbl) result$RSDA # MM format (data.frame with _min/_max columns) head(result$MM) # iGAP format (data.frame with comma-separated values) head(result$iGAP) # ARRAY format (3D array) dim(result$ARRAY) result$ARRAY[1:3, , 1] # minima result$ARRAY[1:3, , 2] # maxima # SODAS/SDS slots are NULL (file-based format) result$SODAS result$SDSdata(car.int) result <- to_all_interval_formats(car.int) names(result) # RSDA format (symbolic_tbl) result$RSDA # MM format (data.frame with _min/_max columns) head(result$MM) # iGAP format (data.frame with comma-separated values) head(result$iGAP) # ARRAY format (3D array) dim(result$ARRAY) result$ARRAY[1:3, , 1] # minima result$ARRAY[1:3, , 2] # maxima # SODAS/SDS slots are NULL (file-based format) result$SODAS result$SDS
Symbolic data for 3 towns (Paris, Lyon, Toulouse) combining school and hospital databases. Contains interval-valued, multi-valued, and modal-valued variables.
data(town_services.mix)data(town_services.mix)
A data frame with 3 observations (Paris, Lyon, Toulouse) and 8 columns:
town: Town name (character).
no_pupils_l, no_pupils_u: Number of pupils range (Min-Max pair).
type: School type (modal, character).
level: Coded level (multi-valued, character).
no_beds_l, no_beds_u: Number of beds range (Min-Max pair).
specialty: Specialty code (multi-valued, character).
| Sample size (n) | 3 |
| Variables (p) | 8 |
| Subject area | Public services |
| Symbolic format | Mixed (interval, modal, multi-valued) |
| Analytical tasks | Descriptive statistics |
Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Table 1.21, p.19.
data(town_services.mix)data(town_services.mix)
Simple 5x3 example illustrating different interval types: full intervals (hyperrectangles), degenerate intervals (lines), and trivial intervals (points). Used for vertices PCA demonstration.
data(trivial_intervals.int)data(trivial_intervals.int)
A data frame with 5 observations (w1–w5) and 6 columns (3 interval
variables in _l/_u Min-Max pairs):
y1_l, y1_u: First interval variable.
y2_l, y2_u: Second interval variable.
y3_l, y3_u: Third interval variable.
| Sample size (n) | 5 |
| Variables (p) | 6 |
| Subject area | Methodology |
| Symbolic format | Interval |
| Analytical tasks | PCA |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 5.1, p.146.
data(trivial_intervals.int)data(trivial_intervals.int)
Interval-valued crime statistics for 46 US states, containing 102 interval-valued variables covering various crime types and rates. Originally from the RSDA package.
data(uscrime.int)data(uscrime.int)
A symbolic data frame (symbolic_tbl) with 46 observations and
102 interval-valued variables. Key variables include:
fold: Cross-validation fold assignment.
population: Population range.
householdsize: Household size range.
racepctblack, racePctWhite, racePctAsian,
racePctHisp: Race percentage ranges.
medIncome, medFamInc, perCapInc: Income ranges.
PctUnemployed, PctEmploy: Employment percentage ranges.
ViolentCrimesPerPop: Violent crimes per population range.
Plus 90 additional interval-valued socio-economic and demographic variables.
| Sample size (n) | 46 |
| Variables (p) | 102 |
| Subject area | Criminology |
| Symbolic format | Interval |
| Analytical tasks | Regression, Clustering |
Extracted from RSDA package (uscrime_int).
Rodriguez, O. (2000). Classification et modeles lineaires en analyse des donnees symboliques. Doctoral Thesis, Universite Paris IX-Dauphine.
data(uscrime.int)data(uscrime.int)
Interval-valued ground snow load data from 415 weather stations in Utah and surrounding states. Each observation is a station with a 50-year ground snow load interval (lower and upper bounds of the prediction interval in kPa) plus the point estimate, geographic coordinates, and elevation.
data(utsnow.int)data(utsnow.int)
A symbolic data frame (symbolic_tbl) with 415 observations
and 5 variables:
snow_load: Interval-valued 50-year ground snow load (kPa).
point_estimate: Numeric point estimate (kPa).
latitude: Numeric latitude (degrees).
longitude: Numeric longitude (degrees).
elevation: Numeric elevation (meters).
| Sample size (n) | 415 |
| Variables (p) | 5 |
| Subject area | Climate |
| Symbolic format | Interval |
| Analytical tasks | Regression, Spatial analysis |
intkrige R package (utsnow dataset).
Schmoyer, R. L. (1993). Permutation tests for correlation in regression errors. Journal of the American Statistical Association, 89(428), 1507–1516.
Bean, B., Sun, Y., and Maguire, M. (2022). Interval-valued kriging models for geostatistical mapping with uncertain inputs.
Original data from the intkrige R package (utsnow dataset).
data(utsnow.int)data(utsnow.int)
Interval-valued veterinary dataset of 10 animal specimens described by height and weight ranges. Includes male and female specimens of horses, bears, foxes, cats, and dogs.
data(veterinary.int)data(veterinary.int)
A symbolic data frame (symbolic_tbl) with 10 observations and 3 variables:
Animal: Animal type and sex label (e.g., HorseM, BearF; character).
Height: Height range (cm, interval).
Weight: Weight range (kg, interval).
| Sample size (n) | 10 |
| Variables (p) | 3 |
| Subject area | Zoology |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics, Clustering |
Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.
data(veterinary.int)data(veterinary.int)
Interval-valued engagement metrics for 10 user groups on a video platform. Variables represent ranges of visit, watch, like, comment, and share counts.
data(video1.int)data(video1.int)
A symbolic data frame (symbolic_tbl) with 10 observations
and 5 interval-valued variables (V1–V5): number of visits, watches,
likes, comments, and shares.
| Sample size (n) | 10 |
| Variables (p) | 5 |
| Subject area | Digital media |
| Symbolic format | Interval |
| Analytical tasks | PCA |
GPCSIV R package (video1 dataset).
Makosso-Kallyth, S. and Diday, E. (2012). Adaptation of interval PCA to symbolic histogram variables. Advances in Data Analysis and Classification, 6(2), 147–159.
Original data from the GPCSIV R package (video1 dataset).
data(video1.int)data(video1.int)
Interval-valued engagement metrics for 10 user groups on a video platform. Variables represent ranges of visit, watch, like, comment, and share counts.
data(video2.int)data(video2.int)
A symbolic data frame (symbolic_tbl) with 10 observations
and 5 interval-valued variables (V1–V5): number of visits, watches,
likes, comments, and shares.
| Sample size (n) | 10 |
| Variables (p) | 5 |
| Subject area | Digital media |
| Symbolic format | Interval |
| Analytical tasks | PCA |
GPCSIV R package (video2 dataset).
Makosso-Kallyth, S. and Diday, E. (2012). Adaptation of interval PCA to symbolic histogram variables. Advances in Data Analysis and Classification, 6(2), 147–159.
Original data from the GPCSIV R package (video2 dataset).
data(video2.int)data(video2.int)
Interval-valued engagement metrics for 10 user groups on a video platform. Variables represent ranges of visit, watch, like, comment, and share counts.
data(video3.int)data(video3.int)
A symbolic data frame (symbolic_tbl) with 10 observations
and 5 interval-valued variables (V1–V5): number of visits, watches,
likes, comments, and shares.
| Sample size (n) | 10 |
| Variables (p) | 5 |
| Subject area | Digital media |
| Symbolic format | Interval |
| Analytical tasks | PCA |
GPCSIV R package (video3 dataset).
Makosso-Kallyth, S. and Diday, E. (2012). Adaptation of interval PCA to symbolic histogram variables. Advances in Data Analysis and Classification, 6(2), 147–159.
Original data from the GPCSIV R package (video3 dataset).
data(video3.int)data(video3.int)
Large interval-valued dataset of water flow sensor readings with 316 observations and 47 interval-valued feature variables (IF1-IF48, excluding IF17), classified into 2 groups. Used as a benchmark for interval data clustering with high-dimensional features.
data(water_flow.int)data(water_flow.int)
A data frame with 316 observations and 48 variables:
if1 through if48 (excluding if17): 47 interval-valued
sensor feature measurements.
class: Group label (1 or 2).
| Sample size (n) | 316 |
| Variables (p) | 48 |
| Subject area | Engineering |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
https://github.com/Natandradesa/Kernel-Clustering-for-Interval-Data
Andrade, N. A., de Carvalho, F. A. T. and Pimentel, B. A. (2025). Kernel clustering with automatic variable weighting for interval data. Neurocomputing, 617, 128954.
data(water_flow.int)data(water_flow.int)
Histogram-valued weight distributions for 7 age groups (20s through 80s). Each observation represents an age decade with a 7-bin histogram of weight values (pounds).
data(weight_age.hist)data(weight_age.hist)
A data frame with 7 observations and 1 histogram-valued variable:
weight: Histogram-valued weight distribution (pounds).
Row names indicate age groups (20s, 30s, 40s, 50s, 60s, 70s, 80s).
| Sample size (n) | 7 |
| Variables (p) | 1 |
| Subject area | Medical |
| Symbolic format | Histogram |
| Analytical tasks | Descriptive statistics |
Billard, L. and Diday, E. (2006), Table 3.10.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. Table 3.10.
data(weight_age.hist)data(weight_age.hist)
Interval-valued chemical and physical properties of 33 wine samples classified into 2 groups. Contains 9 interval-valued measurement variables. Used as a benchmark for interval data clustering algorithms.
data(wine.int)data(wine.int)
A data frame with 33 observations and 10 variables:
V1 through V9: Nine interval-valued chemical/physical
property measurements.
class: Wine group (1 or 2).
| Sample size (n) | 33 |
| Variables (p) | 10 |
| Subject area | Food science |
| Symbolic format | Interval |
| Analytical tasks | Clustering |
https://github.com/Natandradesa/Kernel-Clustering-for-Interval-Data
Andrade, N. A., de Carvalho, F. A. T. and Pimentel, B. A. (2025). Kernel clustering with automatic variable weighting for interval data. Neurocomputing, 617, 128954.
data(wine.int)data(wine.int)
Interval-valued data for soccer teams grouped by World Cup qualification status (yes/no). Includes age, weight, height ranges and the covariance between weight and height.
data(world_cup.int)data(world_cup.int)
A data frame with 2 observations and 8 variables:
world_cup: Qualification status (yes/no, character).
age_l, age_u: Player age range (years).
weight_l, weight_u: Player weight range (kg).
height_l, height_u: Player height range (meters).
cov_weight_height: Covariance between weight and height (numeric).
| Sample size (n) | 2 |
| Variables (p) | 8 |
| Subject area | Sports |
| Symbolic format | Interval |
| Analytical tasks | Descriptive statistics |
Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Table 1.9, p.13.
data(world_cup.int)data(world_cup.int)
Writes a symbolic data object (interval, histogram, modal, or any
data frame) to a CSV file. Interval data stored in RSDA format
(symbolic_tbl with complex columns) is automatically converted to
MM format (paired _min/_max columns) before writing.
write_symbolic_csv( x, file, sep = ",", row.names = TRUE, na = "NA", quote = TRUE, ... )write_symbolic_csv( x, file, sep = ",", row.names = TRUE, na = "NA", quote = TRUE, ... )
x |
A |
file |
Path to the output CSV file. |
sep |
Field separator character. Default |
row.names |
Logical or character. If |
na |
Character string to use for missing values. Default |
quote |
Logical; should character and factor columns be quoted?
Default |
... |
Additional arguments passed to |
write_symbolic_csv handles every tabular symbolic type stored in
dataSDA:
Interval (RSDA): symbolic_tbl objects with complex
interval columns are converted to MM format before writing.
Interval (MM): Data frames with _min/_max
columns are written directly.
Histogram / Modal / Other: Plain data frames are written directly.
The output is a standard CSV that can be read back with
read_symbolic_csv.
Invisibly returns the data frame that was written (after any conversion).
# Interval data (RSDA symbolic_tbl) data(mushroom.int) tmp <- tempfile(fileext = ".csv") write_symbolic_csv(mushroom.int, tmp) cat(readLines(tmp, n = 3), sep = "\n") # Histogram data data(airline_flights.hist) tmp2 <- tempfile(fileext = ".csv") write_symbolic_csv(airline_flights.hist, tmp2) cat(readLines(tmp2, n = 3), sep = "\n")# Interval data (RSDA symbolic_tbl) data(mushroom.int) tmp <- tempfile(fileext = ".csv") write_symbolic_csv(mushroom.int, tmp) cat(readLines(tmp, n = 3), sep = "\n") # Histogram data data(airline_flights.hist) tmp2 <- tempfile(fileext = ".csv") write_symbolic_csv(airline_flights.hist, tmp2) cat(readLines(tmp2, n = 3), sep = "\n")