From the table Appendix A. Metrics defined that I attached in research paper Ope
ID: 3868699 • Letter: F
Question
From the table Appendix A. Metrics defined that I attached in research paper
Open data quality measurement framework Definition and application:
Extract the following based on in the figure 4.2
2 Base measures
2 Derived measures
2 Indicators
2 Information products
2 Attributes
2 Variables
Track of creation
Track of updates
S: Source
dc: Date of creation
lu: List of update
du: Dates of updates
tc=2s + dc
tu = lu +du
[0, 3]
[0, 2]
tcn = tc/3
tun = tu/2
-
-
percentage of
current rows
ncr: Numbers of not current rows
nr: Number of rows
Several authors gave different definitions of timelines and currency (Heinrich, Klier and Kaiser, 2009). One of the most used (adopted by methodologies DQA, COLDQ, CDQ), is timeliness defined as: Timeliness = (max(0; 1- Currency / Volatility)) (Batini, Cappiello, Francalanci and Maurino, 2009).
Other references: Heinrich (2002) & Ballou, Wang, Pazer and Tayi (1998)
da: Date of information availability
dp: Date of publication
sd: Start date of the period of time referred by the dataset
ed: End date of the period of time referred by the dataset.
da = ed + 1
dp = 1- (dp-da/ed-sd)
, 1)
ed: Expiration date
cd: Current date
sd: Start date of the period of time referred by the dataset
ed: End date of the period of time referred by the dataset.
(, + )
if (dae<=0)
daen =0
else if (
dae<=1) daen = rs
else if (dae>1)
dae = 1
nr: Number of rows
nc: Number of columns
ic: Number of incomplete cells
ncl: Number of cells
ncl = nr*nc
pcc = (1-ic/nc)* 100
completeness with the "open world" assumption (i.e., assumption that in the schema not all the real world entities are represented)
(Batini & Scannapieco, 2006).
percentage of complete rows
percentage of standard columns eGMS
Compliance
nr:Number of rows
nir:Number of incomplete rows
ns: Number of columns with associated standards
nsr:Number of standardized columns
s:source
dc:Date of creation
c:Category
t: Title
d: Description(if applicable)
id:Identifier (if applicable)
pb: Publisher (if applicable
cv:Coverage(recommended only)
l:Language (recommended only)
pcpr=(1-nir/nr)*100
psc=(ns/nsc)*100
egmsc=s+dc+c+t+0.2(d+id+pb+cv+l)
[0%,100%][0%,100%]
[0-5]
pcpn=pcpr//100
egmscn=egmsc/5
Interpretability(metric used in the Data Warehouse
Quality-DWQ metrology), defined as: "Number of tuples with interpretable data,
documentation for key values"
(Batini et al., 2009; Jeusfeld et al., 1998).
Five star Open
Data
This metric does not require any formula;
the value assigned
depends on the level of the scheme in which the dataset is.
ncm:Number of column with metadata
nc:Number of columns
ncuf:Number of columns in understandable format
nc:Number of columns
nce:Number of cells with errors
nci: Number of cells
e:Errors sum
s:Scale
oav:Own aggregation value
dav: Dataset aggregation value
e=n |davi – oavi|
i=1
ea = 1- (e/s)
, 1]
if (ea<=0)
ean=0
else if (ea<=0.9)
ean=0.25*ea
else if (ea<=0.95)
ean=0.5*ea
else if (ean<=0.999)
ean = 0.75*ea
if (ea>0.999)
ean =ea
characteristic Metric Vaiables Formula Scale Normalization Alternative in literature TraceabilityTrack of creation
Track of updates
S: Source
dc: Date of creation
lu: List of update
du: Dates of updates
tc=2s + dc
tu = lu +du
[0, 3]
[0, 2]
tcn = tc/3
tun = tu/2
-
-
Currentnesspercentage of
current rows
ncr: Numbers of not current rows
nr: Number of rows
pcr = (1 - ncr/nr) * 100 [0%, 100] pcrn=pcr/100Several authors gave different definitions of timelines and currency (Heinrich, Klier and Kaiser, 2009). One of the most used (adopted by methodologies DQA, COLDQ, CDQ), is timeliness defined as: Timeliness = (max(0; 1- Currency / Volatility)) (Batini, Cappiello, Francalanci and Maurino, 2009).
Other references: Heinrich (2002) & Ballou, Wang, Pazer and Tayi (1998)
Delay in publicationda: Date of information availability
dp: Date of publication
sd: Start date of the period of time referred by the dataset
ed: End date of the period of time referred by the dataset.
da = ed + 1
dp = 1- (dp-da/ed-sd)
-, 1)
dpn = dp Expiration Delay after expirationed: Expiration date
cd: Current date
sd: Start date of the period of time referred by the dataset
ed: End date of the period of time referred by the dataset.
dae = 1 - (cd-ed/ed-sd) -(, + )
if (dae<=0)
daen =0
else if (
dae<=1) daen = rs
else if (dae>1)
dae = 1
- Completeness Percentage of complete cellsnr: Number of rows
nc: Number of columns
ic: Number of incomplete cells
ncl: Number of cells
ncl = nr*nc
pcc = (1-ic/nc)* 100
[0%, 100%] pccn = pcc/100completeness with the "open world" assumption (i.e., assumption that in the schema not all the real world entities are represented)
(Batini & Scannapieco, 2006).
Compliancepercentage of complete rows
percentage of standard columns eGMS
Compliance
nr:Number of rows
nir:Number of incomplete rows
ns: Number of columns with associated standards
nsr:Number of standardized columns
s:source
dc:Date of creation
c:Category
t: Title
d: Description(if applicable)
id:Identifier (if applicable)
pb: Publisher (if applicable
cv:Coverage(recommended only)
l:Language (recommended only)
pcpr=(1-nir/nr)*100
psc=(ns/nsc)*100
egmsc=s+dc+c+t+0.2(d+id+pb+cv+l)
[0%,100%][0%,100%]
[0-5]
pcpn=pcpr//100
egmscn=egmsc/5
Interpretability(metric used in the Data Warehouse
Quality-DWQ metrology), defined as: "Number of tuples with interpretable data,
documentation for key values"
(Batini et al., 2009; Jeusfeld et al., 1998).
Five star Open
Data
This metric does not require any formula;
the value assigned
depends on the level of the scheme in which the dataset is.
[0, 5] fsodn = fsod/5 - Understandability percentage of columns with metadata percentage of columns in comprhensible formatncm:Number of column with metadata
nc:Number of columns
ncuf:Number of columns in understandable format
nc:Number of columns
Accuracy Percentage of syntactically accurate cellsnce:Number of cells with errors
nci: Number of cells
pac=(1-nce/nci)*100 0%, 100%] pacn=pac/100 Semantic accuracy, in which are considered not only the values not belonging to a certain domain but also all the values that don't represent the real world entity correctly. e.g incoherent values, and typos in names (Batini & Scannapieco, 2006; Heinrich, 2002; Kaiser et al., 2007).The metric "derivation integrity" in the TIMQ framework calculates the same thing but in a broader way, it is defined as "percentage of correct calculations of derived data according to the integrity derivation formula or calculation definition" (Batini et al., 2009; English, 1999). Accuracy in aggregatione:Errors sum
s:Scale
oav:Own aggregation value
dav: Dataset aggregation value
e=n |davi – oavi|
i=1
ea = 1- (e/s)
[-, 1]
if (ea<=0)
ean=0
else if (ea<=0.9)
ean=0.25*ea
else if (ea<=0.95)
ean=0.5*ea
else if (ean<=0.999)
ean = 0.75*ea
if (ea>0.999)
ean =ea
Explanation / Answer
2 Base Measures -
da = ed + 1
dp = 1- (dp-da/ed-sd)
pcpr=(1-nir/nr)*100
psc=(ns/nsc)*100
egmsc=s+dc+c+t+0.2(d+id+pb+cv+l)
2 Derived Measures -
tcn = tc/3
tun = tu/2
e=n |davi – oavi|
i=1
ea = 1- (e/s)
2 Indicators -
if (dae<=0)
daen =0
else if (
dae<=1) daen = rs
else if (dae>1)
dae = 1
if (ea<=0)
ean=0
else if (ea<=0.9)
ean=0.25*ea
2 Information Products -
2 Attributes -
2 Variables -
nce:Number of cells with errors
nci: Number of cells
nr: Number of rows
nc: Number of columns
ic: Number of incomplete cells
ncl: Number of cells
ncl = nr*nc
pcc = (1-ic/nc)* 100
Percentage of syntactically accurate cellsnce:Number of cells with errors
nci: Number of cells
pac=(1-nce/nci)*100Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.