Title: | Data Sets for Symbolic Data Analysis |
---|---|
Description: | Collects a diverse range of symbolic data and offers a comprehensive set of functions that facilitate the conversion of traditional data into the symbolic data format. |
Authors: | Po-Wei Chen [aut], Han-Ming Wu [cre] |
Maintainer: | Han-Ming Wu <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.0 |
Built: | 2025-02-23 05:03:55 UTC |
Source: | https://github.com/cran/dataSDA |
A interval-valued data set containing 24 units, created from from the Abalone dataset (UCI Machine Learning Repository), after aggregating by sex and age.
data(Abalone.iGAP)
data(Abalone.iGAP)
An object of class data.frame
with 24 rows and 7 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(Abalone.iGAP)
data(Abalone.iGAP)
Age-cholesterol-weight Interval-Valued Dataset.
data(age_cholesterol_weight.int)
data(age_cholesterol_weight.int)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 7 rows and 4 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(age_cholesterol_weight.int)
data(age_cholesterol_weight.int)
Airline Flights Dataset.
data(airline_flights)
data(airline_flights)
An object of class data.frame
with 16 rows and 17 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(airline_flights)
data(airline_flights)
Airline Flights Modal-Valued Dataset.
data(airline_flights2)
data(airline_flights2)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 16 rows and 6 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(airline_flights2)
data(airline_flights2)
Baseball Interval-Valued Dataset.
data(baseball.int)
data(baseball.int)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 19 rows and 3 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(baseball.int)
data(baseball.int)
Bird Interval-Valued Dataset.
data(bird.int)
data(bird.int)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 20 rows and 2 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(bird.int)
data(bird.int)
blood pressure Interval-Valued Dataset.
data(blood_pressure.int)
data(blood_pressure.int)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 15 rows and 3 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(blood_pressure.int)
data(blood_pressure.int)
Car Interval-Valued Dataset.
data(car.int)
data(car.int)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 8 rows and 5 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(car.int)
data(car.int)
Cars Interval Dataset generated from Cars dataset. This data set consist of the intervals for four characteristics (Price, EngineCapacity, TopSpeed and Acceleration) of 27 cars models partitioned into four different classes (Utilitarian, Berlina, Sportive and Luxury).
data(Cars.int)
data(Cars.int)
A data frame containing 27 observations on 5 variables, the first five with the interval characteristics for 27 car models, the last one a factor indicating the model class.
https://CRAN.R-project.org/package=MAINT.Data
data(Cars.int)
data(Cars.int)
China Temperatures Interval Dataset generated from ChinaTemp dataset. This data set consist of the intervals of observed temperatures (Celsius scale) in each of the four quarters, Q_1 to Q_4, of the years 1974 to 1988 in 60 chinese meteorologic stations; one outlier observation (YinChuan_1982) has been discarded. The 60 stations belong to different regions in China, which therefore define a partition of the 899 stations-year combinations.
data(ChinaTemp.int)
data(ChinaTemp.int)
A data frame containing 899 observations on 5 variables, the first four with the temperatures by quarter in the 899 stations-year combinations, the last one a factor indicating the geographic region of each station.
https://CRAN.R-project.org/package=MAINT.Data
data(ChinaTemp.int)
data(ChinaTemp.int)
This function is used to clean up variable names to conform to the RSDA format.
clean_colnames(data)
clean_colnames(data)
data |
The conventional data. |
Data after cleaning variable names.
data(mushroom) mushroom.clean <- clean_colnames(data = mushroom)
data(mushroom) mushroom.clean <- clean_colnames(data = mushroom)
Crime demographics Dataset.
data(crime)
data(crime)
An object of class data.frame
with 15 rows and 7 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(crime)
data(crime)
Crime demographics Modal-Valued Dataset.
data(crime2)
data(crime2)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 15 rows and 3 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(crime2)
data(crime2)
Finance Interval-Valued Dataset.
data(finance.int)
data(finance.int)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 14 rows and 7 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(finance.int)
data(finance.int)
Fuel Consumption Dataset.
data(fuel_consumption)
data(fuel_consumption)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 10 rows and 3 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(fuel_consumption)
data(fuel_consumption)
Health Insurance Dataset.
data(health_insurance)
data(health_insurance)
An object of class data.frame
with 51 rows and 30 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(health_insurance)
data(health_insurance)
Health Insurance Modal-Valued Dataset.
data(health_insurance2)
data(health_insurance2)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 6 rows and 6 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(health_insurance2)
data(health_insurance2)
Hierarchy Dataset.
data(hierarchy)
data(hierarchy)
An object of class data.frame
with 20 rows and 6 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(hierarchy)
data(hierarchy)
Hierarchy Interval-Valued Dataset.
data(hierarchy.int)
data(hierarchy.int)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 20 rows and 6 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(hierarchy.int)
data(hierarchy.int)
Horses Interval-Valued Dataset.
data(horses.int)
data(horses.int)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 8 rows and 7 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(horses.int)
data(horses.int)
To convert iGAP files to CSV files.
iGAP_to_MM(data, location)
iGAP_to_MM(data, location)
data |
The iGAP file. |
location |
The location of the symbolic variable in the data. |
A CSV data file.
data(Abalone.iGAP) Abalone <- iGAP_to_MM(Abalone.iGAP, c(1, 2, 3, 4, 5, 6, 7))
data(Abalone.iGAP) Abalone <- iGAP_to_MM(Abalone.iGAP, c(1, 2, 3, 4, 5, 6, 7))
Lack of information questionnaire interval dataset generated from lackinfo dataset. A dataset containing some biographical data and the responses to 5 items measuring the perception of lack of information in a questionnaire.
data(lackinfo.int)
data(lackinfo.int)
A data frame with 50 observations of the following 8 variables:
id
: identification number.
sex
: sex of the respondent (male
or female
).
age
: respondent's age (in years).
item1
: respondent's interval-valued answer to item 1.
item2
: respondent's interval-valued answer to item 2.
item3
: respondent's interval-valued answer to item 3.
item4
: respondent's interval-valued answer to item 4.
item5
: respondent's interval-valued answer to item 5.
An educational innovation project was carried out for improving teaching-learning processes at the University of Oviedo (Spain) for the 2020/2021 academic year. A total of 50 students have been requested to answer an online questionnaire about some biographical data (sex and age) and their perception of lack of information by selecting the interval that best represents their level of agreement to the statements proposed in a interval-valued scale bounded between 1 and 7, where 1 represents the option 'strongly disagree' and 7 represents the option 'strongly agree'.
These are the 5 items used to measure the perception of lack of information:
I1: I receive too little information from my classmates.
I2: It is difficult to receive relevant information from my classmates.
I3: It is difficult to receive relevant information from the teacher.
I4: The amount of information I receive from my classmates is very low.
I5: The amount of information I receive from the teacher is very low.
https://CRAN.R-project.org/package=IntervalQuestionStat
data(lackinfo.int)
data(lackinfo.int)
Loans by purpose interval dataset generated from LoansbyPurpose dataset. This data set consist of the lower and upper bounds of the intervals for four interval characteristics of the loans aggregated by their purpose. The original microdata is available at the Kaggle Data Science platform and consists of 887 383 loan records characterized by 75 descriptors. Among the large set of variables available, we focus on borrowers' income and account and loan information aggregated by the 14 loan purposes, wich are considered as the units of interest.
data(LoansbyPurpose.int)
data(LoansbyPurpose.int)
A data frame containing 14 observations on the following 4 variables:
ln-inc
: The current loan purpose of natural logarithm of the self-reported annual income provided by the borrower during registration
ln-revolbal
: The current loan purpose of natural logarithm of the total credit revolving balance
open-acc
: The current loan purpose of the number of open credit lines in the borrower's credit file
total-acc
: The current loan purpose, of the total number of credit lines currently in the borrower's credit file
https://CRAN.R-project.org/package=MAINT.Data
data(LoansbyPurpose.int)
data(LoansbyPurpose.int)
The mushroom data set consists of a set of 23 species described by 3 interval variables. These mushroom species are members of the genus Agaricies. The specific variables and their values are extracted from the Fungi of California Species.
data(mushroom)
data(mushroom)
A data frame with 23 observations and 5 variables named Species, Pileus Cap Width, Stipe Length, Stipe Thickness, and Edibility.
Species
: The class of mushroom.
Pileus Cap Width
: The pileus cap width of the mushroom.
Stipe Length
: The stipe length of the mushroom.
Stipe Thickness
: The stipe thickness of the mushroom.
Edibility
: The edibility of mushroom (U: unknown, Y: Yes, N: No, T: Toxic).
Billard, L. and Diday, E. (2006) Symbolic Data Analysis: Conceptual Statistics and Data Mining John Wiley & Sons, Ltd.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(mushroom)
data(mushroom)
Mushroom interval dataset generated from mushroom dataset. The mushroom data set consists of a set of 23 species described by 3 interval variables. These mushroom species are members of the genus Agaricies. The specific variables and their values are extracted from the Fungi of California Species.
data(mushroom.int)
data(mushroom.int)
A data frame with 23 observations and 5 variables named Species, Pileus Cap Width, Stipe Length, Stipe Thickness, and Edibility.
Species
: The class of mushroom.
Pileus Cap Width
: The pileus cap width of the mushroom.
Stipe Length
: The stipe length of the mushroom.
Stipe Thickness
: The stipe thickness of the mushroom.
Edibility
: The edibility of mushroom (U: unknown, Y: Yes, N: No, T: Toxic).
Billard, L. and Diday, E. (2006) Symbolic Data Analysis: Conceptual Statistics and Data Mining John Wiley & Sons, Ltd.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(mushroom.int)
data(mushroom.int)
New York City flights interval dataset generated from nycflights dataset. A interval-valued data set containing 142 units and four interval-valued variables (dep_delay, arr_delay, air_time and distance), created from from the flights data set in the R package nycflights13 (on-time data for all flights that departed the JFK, LGA or EWR airports in 2013), after removing all rows with missing observations, and aggregating by month and carrier.
data(nycflights.int)
data(nycflights.int)
A data frame containing the original 327346 valid (i.e. with non missing values) flights from the nycflights13 package, described by the 4 variables: dep_delay, arr_delay, air_time and distance.
A factor with 327346 observations and 142 levels, indicating the month by carrier combination to which each orginal flight belongs to.
An IData object with 142 observations and 4 interval-valued variables, describing the intervals formed by agregating the FlightsDF microdata by the 0.05 and 0.95 quantiles of the subsamples formed by FlightsUnits factor.
https://CRAN.R-project.org/package=MAINT.Data
Duarte Silva, A. P., Brito, P., Filzmoser, P., & Dias, J. G. (2021). MAINT. Data: Modelling and Analysing Interval Data in R. R Journal, 13(2).
data(nycflights.int)
data(nycflights.int)
Occupation Salaries Dataset.
data(occupations)
data(occupations)
An object of class data.frame
with 9 rows and 11 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(occupations)
data(occupations)
Occupation Salaries Modal-Valued Dataset.
data(occupations)
data(occupations)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 9 rows and 4 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(occupations2)
data(occupations2)
30 year trimmed mean daily temperatures interval dataset for the Ohio river basin generated from ohtemp dataset. Intervals are defined by the mean daily maximum and minimum temperatures for the Ohio river basin from January 1, 1988 - December 31, 2018. The 116 observations in this dataset all had at least 300 daily observations of temperature in at least 30 of the 31 considered years. The mean was calculated after trimming 10 influence of potential outliers.
data(ohtemp.int)
data(ohtemp.int)
A data frame with 161 rows and 7 variables:
ID
: The global historical climatological network (GHCN) station identifier
NAME
: The GHCN station name
STATE
: The two-digit designation for the state in which each station resides
LATITUDE
: Latitude coordinate position
LONGITUDE
: Longitude coordinate position
ELEVATION
: Elevation of the measurement location (meters)
TEMPERATURE
: The 30 year mean daily temperature (tenths of degrees Celsius)
https://CRAN.R-project.org/package=intkrige
data(ohtemp.int)
data(ohtemp.int)
Profession Work Salary Time Interval-Valued Dataset.
data(profession.int)
data(profession.int)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 15 rows and 4 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(profession.int)
data(profession.int)
This function changes the format of the data to conform to RSDA format.
RSDA_format(data, sym_type1 = NULL, location = NULL, sym_type2 = NULL, var = NULL)
RSDA_format(data, sym_type1 = NULL, location = NULL, sym_type2 = NULL, var = NULL)
data |
A conventional data. |
sym_type1 |
The labels I means an interval variable and $S means set variable. |
location |
The location of the sym_type in the data. |
sym_type2 |
The labels I means an interval variable and $S means set variable. |
var |
The name of the symbolic variable in the data. |
Return a dataframe with a label added to the previous column of symbolic variable.
data("mushroom") mushroom.set <- set_variable_format(data = mushroom, location = 8, var = "Species") mushroom.tmp <- RSDA_format(data = mushroom.set, sym_type1 = c("I", "S"), location = c(25, 31), sym_type2 = c("S", "I", "I"), var = c("Species", "Stipe.Length_min", "Stipe.Thickness_min"))
data("mushroom") mushroom.set <- set_variable_format(data = mushroom, location = 8, var = "Species") mushroom.tmp <- RSDA_format(data = mushroom.set, sym_type1 = c("I", "S"), location = c(25, 31), sym_type2 = c("S", "I", "I"), var = c("Species", "Stipe.Length_min", "Stipe.Thickness_min"))
This function changes the format of the set variables in the data to conform to the RSDA format.
set_variable_format(data, location, var)
set_variable_format(data, location, var)
data |
A conventional data. |
location |
The location of the set variable in the data. |
var |
The name of the set variable in the data. |
Return a dataframe in which a set variable is converted to one-hot encoding.
data("mushroom") mushroom.set <- set_variable_format(data = mushroom, location = 8, var = "Species")
data("mushroom") mushroom.set <- set_variable_format(data = mushroom, location = 8, var = "Species")
Soccer bivar interval dataset generated from soccer.bivar dataset. A real interval-valued data set.
soccer.bivar.int
soccer.bivar.int
A data frame with 20 rows and 3 variables:
y
: The response variable Y (weight)
t1
: The explanatory variable T1 (height)
t2
: The explanatory variable T2 (age)
This data set concerns the record of the Weight (Y), Height (T1) and Age (T2) from 20 soccer teams of the premiere French championship.
https://CRAN.R-project.org/package=iRegression
Lima Neto, E. A., Cordeiro, G. and De Carvalho, F.A.T. (2011). Bivariate symbolic regression models for interval-valued variables. Journal of Statistical Computation and Simulation (Print), 81, 1727–1744.
data(soccer.bivar.int)
data(soccer.bivar.int)
Veterinary Interval-Valued Dataset.
data(veterinary.int)
data(veterinary.int)
An object of class symbolic_tbl
(inherits from tbl_df
, tbl
, data.frame
) with 10 rows and 3 columns.
Billard L. and Diday E. (2006).Symbolic data analysis: Conceptual statistics and data mining. Wiley, Chichester.
data(veterinary.int)
data(veterinary.int)
This function write (save) a symbolic data table from a CSV data file.
write_csv_table(data, file, output)
write_csv_table(data, file, output)
data |
The conventional data. |
file |
The name of the CSV file. |
output |
This is an experimental argument, with default TRUE, and can be ignored by most users. |
Write in CSV file the symbolic data table.
data(mushroom) mushroom.set <- set_variable_format(data = mushroom, location = 8, var = "Species") mushroom.tmp <- RSDA_format(data = mushroom.set, sym_type1 = c("I", "S"), location = c(25, 31), sym_type2 = c("S", "I", "I"), var = c("Species", "Stipe.Length_min", "Stipe.Thickness_min")) mushroom.clean <- clean_colnames(data = mushroom.tmp) # We can save the file in CSV to RSDA format as follows: write_csv_table(data = mushroom.clean, file = "mushroom_interval.csv", output = FALSE)
data(mushroom) mushroom.set <- set_variable_format(data = mushroom, location = 8, var = "Species") mushroom.tmp <- RSDA_format(data = mushroom.set, sym_type1 = c("I", "S"), location = c(25, 31), sym_type2 = c("S", "I", "I"), var = c("Species", "Stipe.Length_min", "Stipe.Thickness_min")) mushroom.clean <- clean_colnames(data = mushroom.tmp) # We can save the file in CSV to RSDA format as follows: write_csv_table(data = mushroom.clean, file = "mushroom_interval.csv", output = FALSE)