Electronic Medical Records with Pandas

11/8/2022

print view

37
38
37
38
38
40
37
37
38
37
37

What's wrong? Why (look at the data)? How do we fix it?

Possible Fix

Use the python csv module to read the file.

37
37
37
37
37
37
37
37
37
37
37

or...

pandas

http://pandas.pydata.org/index.html

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

Basically, if you have a big spreadsheet of data with mixed types.

C:\Users\Akhlore\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3071: DtypeWarning: Columns (10) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,

Note: This is a large dataset so I am preemptively selecting a subset of columns. Try to avoid loading this data more than once as it may take up all your memory and you'll have to restart the python kernel

Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
0 37.0 F 8 197 Skin and subcutaneous tissue infections 0 NO PROC 90335341.0 NaN $9546.85
1 37.0 F 3 146 Diverticulosis and diverticulitis 0 NO PROC 90335341.0 NaN $11462.75
2 37.0 M 1 50 Diabetes mellitus with complications 202 ELECTROCARDIOGRAM 90335341.0 167816.0 $1609.40
3 37.0 F 1 154 Noninfectious gastroenteritis 202 ELECTROCARDIOGRAM 90335341.0 167816.0 $2638.75
4 37.0 F 3 124 Acute and chronic tonsillitis 0 NO PROC 90335341.0 NaN $3538.25
... ... ... ... ... ... ... ... ... ... ...
2367278 943.0 F 1 245 Syncope 0 NO PROC 156102.0 NaN $10074.00
2367279 943.0 M 1 149 Biliary tract disease 0 NO PROC 267443.0 NaN $21252.00
2367280 943.0 F 1 102 Nonspecific chest pain 0 NO PROC 267443.0 NaN $11673.00
2367281 943.0 F 1 660 Alcohol-related disorders 171 SUTURE SKIN/SUBCUT TISS 267443.0 249597.0 $16722.00
2367282 943.0 M 1 2 Septicemia (except in labor) 0 NO PROC 251080.0 NaN $11637.00

2367283 rows × 10 columns

array([[37.0, 'F', '8', ..., 90335341.0, nan, '$9546.85'],
       [37.0, 'F', '3', ..., 90335341.0, nan, '$11462.75'],
       [37.0, 'M', '1', ..., 90335341.0, 167816.0, '$1609.40'],
       ...,
       [943.0, 'F', '1', ..., 267443.0, nan, '$11673.00'],
       [943.0, 'F', '1', ..., 267443.0, 249597.0, '$16722.00'],
       [943.0, 'M', '1', ..., 251080.0, nan, '$11637.00']], dtype=object)

Data Access

Columns (and rows) have names that you can use to access them.

Index(['Facility ID', 'Gender', 'Length of Stay', 'CCS Diagnosis Code',
       'CCS Diagnosis Description', 'CCS Procedure Code',
       'CCS Procedure Description', 'Attending Provider License Number',
       'Operating Provider License Number', 'Total Charges'],
      dtype='object')
0          F
1          F
2          M
3          F
4          F
          ..
2367278    F
2367279    M
2367280    F
2367281    F
2367282    M
Name: Gender, Length: 2367283, dtype: object

Data Access

0    197
1    146
2     50
Name: CCS Diagnosis Code, dtype: int64

[] slices by rows, but indexes by column name - must provide range or it interprets the index as a column label.

Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
0 37.0 F 8 197 Skin and subcutaneous tissue infections 0 NO PROC 90335341.0 NaN $9546.85
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-59-3512e1a6c539> in <module>
----> 1 data[0]

~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2798             if self.columns.nlevels > 1:
   2799                 return self._getitem_multilevel(key)
-> 2800             indexer = self.columns.get_loc(key)
   2801             if is_integer(indexer):
   2802                 indexer = [indexer]

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

iloc: Position indexing

If you want to reference a pandas data frame with position based indexing, use .iloc - work's just like numpy

'F'
Facility ID                                                               37
Gender                                                                     F
Length of Stay                                                             8
CCS Diagnosis Code                                                       197
CCS Diagnosis Description            Skin and subcutaneous tissue infections
CCS Procedure Code                                                         0
CCS Procedure Description                                            NO PROC
Attending Provider License Number                                9.03353e+07
Operating Provider License Number                                        NaN
Total Charges                                                       $9546.85
Name: 0, dtype: object

Pandas uses NaN to indicate missing data

Facility ID                                                               37
Gender                                                                     F
Length of Stay                                                             8
CCS Diagnosis Code                                                       197
CCS Diagnosis Description            Skin and subcutaneous tissue infections
CCS Procedure Code                                                         0
CCS Procedure Description                                            NO PROC
Attending Provider License Number                                9.03353e+07
Operating Provider License Number                                        NaN
Total Charges                                                        9546.85
Charge per day                                                       1193.36
Name: 0, dtype: object
Facility ID                                                               37
Gender                                                                     F
Length of Stay                                                             8
CCS Diagnosis Code                                                       197
CCS Diagnosis Description            Skin and subcutaneous tissue infections
CCS Procedure Code                                                         0
CCS Procedure Description                                            NO PROC
Attending Provider License Number                                9.03353e+07
Operating Provider License Number                                        NaN
Total Charges                                                       $9546.85
Name: 0, dtype: object

.loc: Label indexing

You can also index by the label names. Note the rows are being indexed by their named index.

0    F
1    F
2    M
3    F
Name: Gender, dtype: object
CCS Diagnosis Description CCS Diagnosis Code
0 Skin and subcutaneous tissue infections 197
3 Noninfectious gastroenteritis 154
5 Influenza 123
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
10 37.0 F 3 123 Influenza 0 NO PROC 90335341.0 NaN $4566.15
11 37.0 M 7 122 Pneumonia (except that caused by tuberculosis ... 202 ELECTROCARDIOGRAM 90335341.0 167816.0 $9822.90
12 37.0 M 3 122 Pneumonia (except that caused by tuberculosis ... 0 NO PROC 90335341.0 NaN $5063.05
13 37.0 M 2 155 Other gastrointestinal disorders 0 NO PROC 90335341.0 NaN $3125.75
14 37.0 F 3 122 Pneumonia (except that caused by tuberculosis ... 0 NO PROC 90335341.0 NaN $5055.45
15 37.0 F 6 127 Chronic obstructive pulmonary disease and bron... 0 NO PROC 90335341.0 NaN $9734.05
16 37.0 F 4 127 Chronic obstructive pulmonary disease and bron... 0 NO PROC 90335341.0 NaN $7168.05
17 37.0 F 2 197 Skin and subcutaneous tissue infections 0 NO PROC 90301264.0 NaN $2812.85
18 37.0 M 3 58 Other nutritional; endocrine; and metabolic di... 0 NO PROC 90335341.0 NaN $3377.50
19 37.0 F 2 125 Acute bronchitis 0 NO PROC 90335341.0 NaN $3214.25
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-67-4009993f2f6d> in <module>
----> 1 data[10:20].loc[[0,3,5]]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1766 
   1767             maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768             return self._getitem_axis(maybe_callable, axis=axis)
   1769 
   1770     def _is_scalar_access(self, key: Tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1952                     raise ValueError("Cannot index with multidimensional key")
   1953 
-> 1954                 return self._getitem_iterable(key, axis=axis)
   1955 
   1956             # nested tuple slicing

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
   1593         else:
   1594             # A collection of keys
-> 1595             keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
   1596             return self.obj._reindex_with_indexers(
   1597                 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1550             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1551 
-> 1552         self._validate_read_indexer(
   1553             keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
   1554         )

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1638             if missing == len(indexer):
   1639                 axis_name = self.obj._get_axis_name(axis)
-> 1640                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1641 
   1642             # We (temporarily) allow for some missing keys with .loc, except in

KeyError: "None of [Int64Index([0, 3, 5], dtype='int64')] are in the [index]"
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
10 37.0 F 3 123 Influenza 0 NO PROC 90335341.0 NaN $4566.15
13 37.0 M 2 155 Other gastrointestinal disorders 0 NO PROC 90335341.0 NaN $3125.75
15 37.0 F 6 127 Chronic obstructive pulmonary disease and bron... 0 NO PROC 90335341.0 NaN $9734.05

Boolean Indexing

Just like numpy we can index by a boolean array or an array of indices.

Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
2 37.0 M 1 50 Diabetes mellitus with complications 202 ELECTROCARDIOGRAM 90335341.0 167816.0 $1609.40
6 37.0 M 5 122 Pneumonia (except that caused by tuberculosis ... 0 NO PROC 90335341.0 NaN $6148.10
7 37.0 M 3 123 Influenza 0 NO PROC 90335341.0 NaN $4204.15
Facility ID CCS Diagnosis Code CCS Procedure Code
0 37.0 197 0
1 37.0 146 0
2 37.0 50 202

Sorting

Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
1426998 1439.0 M 1 108 Congestive heart failure; nonhypertensive 0 NO PROC 251948.0 NaN $0.50
780065 989.0 M 2 657 Mood disorders 218 PSYCHO/PSYCHI EVAL/THER 199003.0 154247.0 $1.00
781531 989.0 F 1 661 Substance-related disorders 0 NO PROC 145418.0 NaN $1.00
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
77722 213.0 F 10 100 Acute myocardial infarction 44 COR ARTERY BYP GRF-CABG 224769.0 224769.0 $99999.65
451015 635.0 F 1 115 Aortic; peripheral; and visceral artery aneurysms 52 AORTIC RESECTION; REPL 232988.0 232988.0 $99999.46
1579047 1456.0 M 12 2 Septicemia (except in labor) 157 AMPUTATE LOWER EXTRMITY 258717.0 265448.0 $99998.45

String Methods

Can apply standard string functions to all cells. This returns the changed value; it does not mutate in place.

The above overwrites the previous Total Charges column to be a floating point number instead of a string with a dollar sign.

Correct sorting

Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
967254 1169.0 M 120 + 63 Diseases of white blood cells 64 BONE MARROW TRANSPLANT 198304.0 229870.0 8593455.88
1560003 1456.0 F 120 + 143 Abdominal hernia 86 OTHER HERNIA REPAIR 165181.0 165181.0 6272871.31
957685 1169.0 F 120 + 6 Hepatitis 176 OT ORGAN TRANSPLANTATN 236414.0 183253.0 5745201.42

Creating New Columns

Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges Charge per day
0 37.0 F 8.0 197 Skin and subcutaneous tissue infections 0 NO PROC 90335341.0 NaN 9546.85 1193.356250
1 37.0 F 3.0 146 Diverticulosis and diverticulitis 0 NO PROC 90335341.0 NaN 11462.75 3820.916667
2 37.0 M 1.0 50 Diabetes mellitus with complications 202 ELECTROCARDIOGRAM 90335341.0 167816.0 1609.40 1609.400000
3 37.0 F 1.0 154 Noninfectious gastroenteritis 202 ELECTROCARDIOGRAM 90335341.0 167816.0 2638.75 2638.750000
4 37.0 F 3.0 124 Acute and chronic tonsillitis 0 NO PROC 90335341.0 NaN 3538.25 1179.416667
... ... ... ... ... ... ... ... ... ... ... ...
2367278 943.0 F 1.0 245 Syncope 0 NO PROC 156102.0 NaN 10074.00 10074.000000
2367279 943.0 M 1.0 149 Biliary tract disease 0 NO PROC 267443.0 NaN 21252.00 21252.000000
2367280 943.0 F 1.0 102 Nonspecific chest pain 0 NO PROC 267443.0 NaN 11673.00 11673.000000
2367281 943.0 F 1.0 660 Alcohol-related disorders 171 SUTURE SKIN/SUBCUT TISS 267443.0 249597.0 16722.00 16722.000000
2367282 943.0 M 1.0 2 Septicemia (except in labor) 0 NO PROC 251080.0 NaN 11637.00 11637.000000

2367283 rows × 11 columns

nan
'120 +'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas\_libs\lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "120 +"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-81-260213ec4baa> in <module>
----> 1 pd.to_numeric('120 +')

~\anaconda3\lib\site-packages\pandas\core\tools\numeric.py in to_numeric(arg, errors, downcast)
    147         coerce_numeric = errors not in ("ignore", "raise")
    148         try:
--> 149             values = lib.maybe_convert_numeric(
    150                 values, set(), coerce_numeric=coerce_numeric
    151             )

pandas\_libs\lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "120 +" at position 0

Group by

Group records that have the same value for a column

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FFD926BBB0>

We can then apply an aggregating function to these groups.

Length of Stay CCS Diagnosis Code CCS Procedure Code Attending Provider License Number Operating Provider License Number Total Charges Charge per day
Facility ID
377.0 2.670732 121.170732 202.439024 4.664694e+07 4.826434e+07 4060.734024 1614.759785
111.0 3.611111 159.777778 0.000000 1.929986e+05 NaN 4492.587778 1473.003327
37.0 3.162791 123.906977 54.779070 8.298531e+07 7.666251e+06 4939.879651 1628.579552
9250.0 2.203456 205.840173 137.955076 1.082429e+07 1.264187e+07 5618.928860 2497.869519
165.0 4.597884 180.206349 181.920635 2.051072e+05 4.744077e+07 6376.906455 1580.321499
... ... ... ... ... ... ... ...
563.0 6.279619 129.367643 78.085210 2.363745e+05 6.581608e+05 89385.986450 20937.033431
1446.0 5.633545 214.859847 157.190406 3.497396e+05 3.955800e+05 124204.521077 36290.348526
1139.0 8.937250 205.031446 87.044577 3.341266e+05 3.847957e+05 131992.454991 17629.688410
1138.0 38.527005 215.072013 213.108020 1.963923e+05 1.966056e+05 175997.805859 4195.405483
1486.0 61.821429 134.039683 194.801587 1.863599e+05 1.921974e+05 195780.272579 2486.885826

215 rows × 7 columns

Example

Total Charges
Gender
U 9858.198939
F 36867.585805
M 44419.344760

The group by column has become an indexing column. Need to reset_index to convert index to columns.

Gender Total Charges
0 U 9858.198939
1 F 36867.585805
2 M 44419.344760

Example

CCS Procedure Description  Gender
ABDOMINAL PARACENTESIS     F         56996.536195
                           M         58532.559231
ABORTION (TERM OF PREG)    F         26512.622947
ALCO/DRUG REHAB/DETOX      F         18954.578164
                           M         18187.073614
                                         ...     
UPPER GI X-RAY             M         31221.460246
URETERAL CATHETERIZATN     F         36143.329481
                           M         34182.121658
VARI VEIN STRIP;LOW LMB    F         42519.938571
                           M         65763.081875
Name: Total Charges, Length: 446, dtype: float64

unstack Pivot a level of the (necessarily hierarchical) index labels.

Gender F M U
CCS Procedure Description
ABDOMINAL PARACENTESIS 56996.536195 58532.559231 NaN
ABORTION (TERM OF PREG) 26512.622947 NaN NaN
ALCO/DRUG REHAB/DETOX 18954.578164 18187.073614 NaN
AMPUTATE LOWER EXTRMITY 100926.639344 92821.328596 NaN
AORTIC RESECTION; REPL 144019.629159 114210.484050 NaN
... ... ... ...
UNGROUPABLE 68101.627500 105811.388636 NaN
UP GASTRO ENDOSC/BIOPSY 42625.453678 43340.006917 NaN
UPPER GI X-RAY 28339.654074 31221.460246 NaN
URETERAL CATHETERIZATN 36143.329481 34182.121658 NaN
VARI VEIN STRIP;LOW LMB 42519.938571 65763.081875 NaN

233 rows × 3 columns

CCS Procedure Description
ABDOMINAL PARACENTESIS     -1536.023036
ABORTION (TERM OF PREG)             NaN
ALCO/DRUG REHAB/DETOX        767.504550
AMPUTATE LOWER EXTRMITY     8105.310748
AORTIC RESECTION; REPL     29809.145110
                               ...     
UNGROUPABLE               -37709.761136
UP GASTRO ENDOSC/BIOPSY     -714.553239
UPPER GI X-RAY             -2881.806172
URETERAL CATHETERIZATN      1961.207823
VARI VEIN STRIP;LOW LMB   -23243.143304
Length: 233, dtype: float64
CCS Procedure Description
ABDOMINAL PARACENTESIS     -1536.023036
ALCO/DRUG REHAB/DETOX        767.504550
AMPUTATE LOWER EXTRMITY     8105.310748
AORTIC RESECTION; REPL     29809.145110
APPENDECTOMY                 209.527573
                               ...     
UNGROUPABLE               -37709.761136
UP GASTRO ENDOSC/BIOPSY     -714.553239
UPPER GI X-RAY             -2881.806172
URETERAL CATHETERIZATN      1961.207823
VARI VEIN STRIP;LOW LMB   -23243.143304
Length: 202, dtype: float64
CCS Procedure Description
EXTRA CIRC AUX OPEN HRT   -69453.636115
DX PRCS ON EYE            -63462.345439
SWAN-GANZ CATH MONITOR    -46738.846837
PRCS ON SPLEEN            -46219.301962
CORNEAL TRANSPLANT        -45028.501148
dtype: float64
CCS Procedure Description
LENS & CATARACT PRCS       35378.240602
OT INTRAOCULAR THER PRC    35907.146236
BONE MARROW TRANSPLANT     37066.859998
DES LES RETINA/CHOROID     52461.621223
CORONARY THROMBOLYSIS      87870.550000
dtype: float64

Combining DataFrames

pd.concat concatenates rows (i.e., default axis=0) while merging columns with the same name.

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
B D F
2 B2 D2 F2
3 B3 D3 F3
6 B6 D6 F6
7 B7 D7 F7
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
2 NaN B2 NaN D2 F2
3 NaN B3 NaN D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7

Join

You can join two tables on a specific column (or columns). Rows that has the same value (or key) in that column will be combined.

  • inner join - key must exist in both tables
  • outer join - key can exist in either table
  • left join - key must exist in left table
  • right join - key must exist in right table
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key C D
0 K1 C1 D1
1 K2 C2 D2
2 K4 C4 D4

Inner Join

key A B C D
0 K1 A1 B1 C1 D1
1 K2 A2 B2 C2 D2

Outer Join

key A B C D
0 K0 A0 B0 NaN NaN
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 NaN NaN
4 K4 NaN NaN C4 D4

Left Join

key A B C D
0 K0 A0 B0 NaN NaN
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 NaN NaN

Questions

Download: https://mscbio2025.csb.pitt.edu/files/Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv

  • How many data records are there?
  • How many coronary bypasses are there?
  • What is the average cost? Standard deviation?
  • What is the most common diagnosis that leads to a coronary bypass?
  • What percent of people with that diagnosis get a coronary bypass?
  • What are the facilities whose average cost for this operation is in the top 10%? Bottom 10%?
  • How correlated is the length of stay to the cost?
  • Is the percentage of people who go to these facilities with the most common diagnosis and receive a coronary bypass significantly different between these two groups?

  • What about knee replacements?

  • How well can a decision tree predict the cost of the operation? What are the most important features?