mirror of https://github.com/01-edu/public.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
127 lines
5.5 KiB
127 lines
5.5 KiB
2 years ago
|
Citation Request:
|
||
|
This breast cancer databases was obtained from the University of Wisconsin
|
||
|
Hospitals, Madison from Dr. William H. Wolberg. If you publish results
|
||
|
when using this database, then please include this information in your
|
||
|
acknowledgements. Also, please cite one or more of:
|
||
|
|
||
|
1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear
|
||
|
programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
|
||
|
|
||
|
2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of
|
||
|
pattern separation for medical diagnosis applied to breast cytology",
|
||
|
Proceedings of the National Academy of Sciences, U.S.A., Volume 87,
|
||
|
December 1990, pp 9193-9196.
|
||
|
|
||
|
3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition
|
||
|
via linear programming: Theory and application to medical diagnosis",
|
||
|
in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying
|
||
|
Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.
|
||
|
|
||
|
4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming
|
||
|
discrimination of two linearly inseparable sets", Optimization Methods
|
||
|
and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).
|
||
|
|
||
|
1. Title: Wisconsin Breast Cancer Database (January 8, 1991)
|
||
|
|
||
|
2. Sources:
|
||
|
-- Dr. WIlliam H. Wolberg (physician)
|
||
|
University of Wisconsin Hospitals
|
||
|
Madison, Wisconsin
|
||
|
USA
|
||
|
-- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)
|
||
|
Received by David W. Aha (aha@cs.jhu.edu)
|
||
|
-- Date: 15 July 1992
|
||
|
|
||
|
3. Past Usage:
|
||
|
|
||
|
Attributes 2 through 10 have been used to represent instances.
|
||
|
Each instance has one of 2 possible classes: benign or malignant.
|
||
|
|
||
|
1. Wolberg,~W.~H., \& Mangasarian,~O.~L. (1990). Multisurface method of
|
||
|
pattern separation for medical diagnosis applied to breast cytology. In
|
||
|
{\it Proceedings of the National Academy of Sciences}, {\it 87},
|
||
|
9193--9196.
|
||
|
-- Size of data set: only 369 instances (at that point in time)
|
||
|
-- Collected classification results: 1 trial only
|
||
|
-- Two pairs of parallel hyperplanes were found to be consistent with
|
||
|
50% of the data
|
||
|
-- Accuracy on remaining 50% of dataset: 93.5%
|
||
|
-- Three pairs of parallel hyperplanes were found to be consistent with
|
||
|
67% of data
|
||
|
-- Accuracy on remaining 33% of dataset: 95.9%
|
||
|
|
||
|
2. Zhang,~J. (1992). Selecting typical instances in instance-based
|
||
|
learning. In {\it Proceedings of the Ninth International Machine
|
||
|
Learning Conference} (pp. 470--479). Aberdeen, Scotland: Morgan
|
||
|
Kaufmann.
|
||
|
-- Size of data set: only 369 instances (at that point in time)
|
||
|
-- Applied 4 instance-based learning algorithms
|
||
|
-- Collected classification results averaged over 10 trials
|
||
|
-- Best accuracy result:
|
||
|
-- 1-nearest neighbor: 93.7%
|
||
|
-- trained on 200 instances, tested on the other 169
|
||
|
-- Also of interest:
|
||
|
-- Using only typical instances: 92.2% (storing only 23.1 instances)
|
||
|
-- trained on 200 instances, tested on the other 169
|
||
|
|
||
|
4. Relevant Information:
|
||
|
|
||
|
Samples arrive periodically as Dr. Wolberg reports his clinical cases.
|
||
|
The database therefore reflects this chronological grouping of the data.
|
||
|
This grouping information appears immediately below, having been removed
|
||
|
from the data itself:
|
||
|
|
||
|
Group 1: 367 instances (January 1989)
|
||
|
Group 2: 70 instances (October 1989)
|
||
|
Group 3: 31 instances (February 1990)
|
||
|
Group 4: 17 instances (April 1990)
|
||
|
Group 5: 48 instances (August 1990)
|
||
|
Group 6: 49 instances (Updated January 1991)
|
||
|
Group 7: 31 instances (June 1991)
|
||
|
Group 8: 86 instances (November 1991)
|
||
|
-----------------------------------------
|
||
|
Total: 699 points (as of the donated datbase on 15 July 1992)
|
||
|
|
||
|
Note that the results summarized above in Past Usage refer to a dataset
|
||
|
of size 369, while Group 1 has only 367 instances. This is because it
|
||
|
originally contained 369 instances; 2 were removed. The following
|
||
|
statements summarizes changes to the original Group 1's set of data:
|
||
|
|
||
|
##### Group 1 : 367 points: 200B 167M (January 1989)
|
||
|
##### Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805
|
||
|
##### Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record
|
||
|
##### : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial
|
||
|
##### : Changed 0 to 1 in field 6 of sample 1219406
|
||
|
##### : Changed 0 to 1 in field 8 of following sample:
|
||
|
##### : 1182404,2,3,1,1,1,2,0,1,1,1
|
||
|
|
||
|
5. Number of Instances: 699 (as of 15 July 1992)
|
||
|
|
||
|
6. Number of Attributes: 10 plus the class attribute
|
||
|
|
||
|
7. Attribute Information: (class attribute has been moved to last column)
|
||
|
|
||
|
# Attribute Domain
|
||
|
-- -----------------------------------------
|
||
|
1. Sample code number id number
|
||
|
2. Clump Thickness 1 - 10
|
||
|
3. Uniformity of Cell Size 1 - 10
|
||
|
4. Uniformity of Cell Shape 1 - 10
|
||
|
5. Marginal Adhesion 1 - 10
|
||
|
6. Single Epithelial Cell Size 1 - 10
|
||
|
7. Bare Nuclei 1 - 10
|
||
|
8. Bland Chromatin 1 - 10
|
||
|
9. Normal Nucleoli 1 - 10
|
||
|
10. Mitoses 1 - 10
|
||
|
11. Class: (2 for benign, 4 for malignant)
|
||
|
|
||
|
8. Missing attribute values: 16
|
||
|
|
||
|
There are 16 instances in Groups 1 to 6 that contain a single missing
|
||
|
(i.e., unavailable) attribute value, now denoted by "?".
|
||
|
|
||
|
9. Class distribution:
|
||
|
|
||
|
Benign: 458 (65.5%)
|
||
|
Malignant: 241 (34.5%)
|