admin管理员组

文章数量:1530045

写在前面

关于数据科学环境的建立,可以参考我的博客:

【深耕 Python】Data Science with Python 数据科学(1)环境搭建

往期数据科学博文:

【深耕 Python】Data Science with Python 数据科学(2)jupyter-lab和numpy数组

【深耕 Python】Data Science with Python 数据科学(3)Numpy 常量、函数和线性空间

【深耕 Python】Data Science with Python 数据科学(4)(书337页)练习题及解答

【深耕 Python】Data Science with Python 数据科学(5)Matplotlib可视化(1)

【深耕 Python】Data Science with Python 数据科学(6)Matplotlib可视化(2)

【深耕 Python】Data Science with Python 数据科学(7)书352页练习题

【深耕 Python】Data Science with Python 数据科学(8)pandas数据结构:Series和DataFrame

【深耕 Python】Data Science with Python 数据科学(9)书361页练习题

代码说明: 由于实机运行的原因,可能省略了某些导入(import)语句。

本期使用Pandas库进行初步的数据处理分析,所用的csv为历年诺贝尔奖获得者信息(位于laureates.csv中,需事先下载)。使用pandas中的简单命令对庞大的文件进行匹配和搜索,得到有关理查德费曼(Richard Feynman)和基普索恩(Kip Thorne)的获奖信息,直到遇到第一个报错(ValueError)。csv文件的下载命令如下:

curl -OL https://cdn.learnenough.com/laureates.csv

Python Code Snippet 1

import pandas as pd

nobel = pd.read_csv("laureates.csv")
print("Output for describe() method:")
print(nobel.describe())
print()
print("Output for head() method:")
print(nobel.head())
print()
print("Output for info() method:")
print(nobel.info())
print()
Output for describe() method:
                id         year       share
count   975.000000   975.000000  975.000000
mean    496.221538  1972.471795    2.014359
std     290.594353    34.058064    0.943909
min       1.000000  1901.000000    1.000000
25%     244.500000  1948.500000    1.000000
50%     488.000000  1978.000000    2.000000
75%     746.500000  2001.000000    3.000000
max    1009.000000  2021.000000    4.000000

Output for head() method:
   id       firstname    surname        born        died  \
0   1  Wilhelm Conrad    Röntgen  1845-03-27  1923-02-10   
1   2      Hendrik A.    Lorentz  1853-07-18  1928-02-04   
2   3          Pieter     Zeeman  1865-05-25  1943-10-09   
3   4           Henri  Becquerel  1852-12-15  1908-08-25   
4   5          Pierre      Curie  1859-05-15  1906-04-19   

             bornCountry bornCountryCode                bornCity  \
0  Prussia (now Germany)              DE  Lennep (now Remscheid)   
1        the Netherlands              NL                  Arnhem   
2        the Netherlands              NL              Zonnemaire   
3                 France              FR                   Paris   
4                 France              FR                   Paris   

       diedCountry diedCountryCode   diedCity gender  year category  \
0          Germany              DE     Munich   male  1901  physics   
1  the Netherlands              NL        NaN   male  1902  physics   
2  the Netherlands              NL  Amsterdam   male  1902  physics   
3           France              FR        NaN   male  1903  physics   
4           France              FR      Paris   male  1903  physics   

  overallMotivation  share                                         motivation  \
0               NaN      1  "in recognition of the extraordinary services ...   
1               NaN      2  "in recognition of the extraordinary service t...   
2               NaN      2  "in recognition of the extraordinary service t...   
3               NaN      2  "in recognition of the extraordinary services ...   
4               NaN      4  "in recognition of the extraordinary services ...   

                                                name       city  \
0                                  Munich University     Munich   
1                                  Leiden University     Leiden   
2                               Amsterdam University  Amsterdam   
3                                École Polytechnique      Paris   
4  École municipale de physique et de chimie indu...      Paris   

           country  
0          Germany  
1  the Netherlands  
2  the Netherlands  
3           France  
4           France  

Output for info() method:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 975 entries, 0 to 974
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 975 non-null    int64 
 1   firstname          975 non-null    object
 2   surname            945 non-null    object
 3   born               974 non-null    object
 4   died               975 non-null    object
 5   bornCountry        946 non-null    object
 6   bornCountryCode    946 non-null    object
 7   bornCity           943 non-null    object
 8   diedCountry        640 non-null    object
 9   diedCountryCode    640 non-null    object
 10  diedCity           634 non-null    object
 11  gender             975 non-null    object
 12  year               975 non-null    int64 
 13  category           975 non-null    object
 14  overallMotivation  23 non-null     object
 15  share              975 non-null    int64 
 16  motivation         975 non-null    object
 17  name               717 non-null    object
 18  city               712 non-null    object
 19  country            713 non-null    object
dtypes: int64(3), object(17)
memory usage: 152.5+ KB
None

Python Code Snippet 2

print(nobel[nobel["surname"] == "Feynman"])
print(nobel[nobel["surname"] == "Feynman"].year)
print((nobel["surname"] == "Feynman")[86])
print(nobel.loc[nobel["surname"] == "Feynman", "year"])
print(nobel.loc[nobel["firstname"] == "Kip"])
print(nobel.loc[nobel["firstname"] == "Kip S."])
print(nobel.loc[nobel["firstname"] == "Kip S."].year)
print(nobel.loc[nobel["firstname"].str.contains("Kip")])
print(nobel.loc[nobel["surname"].str.contains("Feynman")])

Output (with error at the end):

# 【美】理查德费曼(1918-1988)的获奖信息,1965年,加州理工学院,位于条目86中:
    id   firstname  surname        born        died bornCountry  \
86  86  Richard P.  Feynman  1918-05-11  1988-02-15         USA   

   bornCountryCode     bornCity diedCountry diedCountryCode        diedCity  \
86              US  New York NY         USA              US  Los Angeles CA   

   gender  year category overallMotivation  share  \
86   male  1965  physics               NaN      3   

                                           motivation  \
86  "for their fundamental work in quantum electro...   

                                            name         city country  
86  California Institute of Technology (Caltech)  Pasadena CA     USA
# 获奖年份:1965
86    1965
Name: year, dtype: int64
# 
True
# 仅返回年份信息
86    1965
Name: year, dtype: int64

# “Kip”的搜索结果为空
Empty DataFrame
Columns: [id, firstname, surname, born, died, bornCountry, bornCountryCode, bornCity, diedCountry, diedCountryCode, diedCity, gender, year, category, overallMotivation, share, motivation, name, city, country]
Index: []

# 改为Kip S.,得到基普索恩((1940-),引力波天文台)的获奖信息:
      id firstname surname        born        died bornCountry  \
916  943    Kip S.  Thorne  1940-06-01  0000-00-00         USA   

    bornCountryCode  bornCity diedCountry diedCountryCode diedCity gender  \
916              US  Logan UT         NaN             NaN      NaN   male   

     year category overallMotivation  share  \
916  2017  physics               NaN      4   

                                            motivation  \
916  "for decisive contributions to the LIGO detect...   

                         name city country  
916  LIGO/VIRGO Collaboration  NaN     NaN

# 基普索恩Kip S. Thorne获奖年份:
916    2017
Name: year, dtype: int64

# 字符串搜寻,寻找姓氏中带有Kip的条目,返回916条:
      id firstname surname        born        died bornCountry  \
916  943    Kip S.  Thorne  1940-06-01  0000-00-00         USA   

    bornCountryCode  bornCity diedCountry diedCountryCode diedCity gender  \
916              US  Logan UT         NaN             NaN      NaN   male   

     year category overallMotivation  share  \
916  2017  physics               NaN      4   

                                            motivation  \
916  "for decisive contributions to the LIGO detect...   

                         name city country  
916  LIGO/VIRGO Collaboration  NaN     NaN  

# 报错:由于存在NaN(Not a Number)条目,返回ValueError:
[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m                                Traceback (most recent call last)
[0;32m<ipython-input-7-a3d47195d041>[0m in [0;36m<module>[0;34m[0m
[1;32m      7[0m [0mprint[0m[0;34m([0m[0mnobel[0m[0;34m.[0m[0mloc[0m[0;34m[[0m[0mnobel[0m[0;34m[[0m[0;34m"firstname"[0m[0;34m][0m [0;34m==[0m [0;34m"Kip S."[0m[0;34m][0m[0;34m.[0m[0myear[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m      8[0m [0mprint[0m[0;34m([0m[0mnobel[0m[0;34m.[0m[0mloc[0m[0;34m[[0m[0mnobel[0m[0;34m[[0m[0;34m"firstname"[0m[0;34m][0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mcontains[0m[0;34m([0m[0;34m"Kip"[0m[0;34m)[0m[0;34m][0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 9[0;31m [0mprint[0m[0;34m([0m[0mnobel[0m[0;34m.[0m[0mloc[0m[0;34m[[0m[0mnobel[0m[0;34m[[0m[0;34m"surname"[0m[0;34m][0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mcontains[0m[0;34m([0m[0;34m"Feynman"[0m[0;34m)[0m[0;34m][0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m~/PycharmProjects/data_science/.venv/lib/python3.6/site-packages/pandas/core/indexing.py[0m in [0;36m__getitem__[0;34m(self, key)[0m
[1;32m    877[0m [0;34m[0m[0m
[1;32m    878[0m             [0mmaybe_callable[0m [0;34m=[0m [0mcom[0m[0;34m.[0m[0mapply_if_callable[0m[0;34m([0m[0mkey[0m[0;34m,[0m [0mself[0m[0;34m.[0m[0mobj[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 879[0;31m             [0;32mreturn[0m [0mself[0m[0;34m.[0m[0m_getitem_axis[0m[0;34m([0m[0mmaybe_callable[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0maxis[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m    880[0m [0;34m[0m[0m
[1;32m    881[0m     [0;32mdef[0m [0m_is_scalar_access[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mkey[0m[0;34m:[0m [0mTuple[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m

[0;32m~/PycharmProjects/data_science/.venv/lib/python3.6/site-packages/pandas/core/indexing.py[0m in [0;36m_getitem_axis[0;34m(self, key, axis)[0m
[1;32m   1087[0m             [0mself[0m[0;34m.[0m[0m_validate_key[0m[0;34m([0m[0mkey[0m[0;34m,[0m [0maxis[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m   1088[0m             [0;32mreturn[0m [0mself[0m[0;34m.[0m[0m_get_slice_axis[0m[0;34m([0m[0mkey[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0maxis[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m-> 1089[0;31m         [0;32melif[0m [0mcom[0m[0;34m.[0m[0mis_bool_indexer[0m[0;34m([0m[0mkey[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m   1090[0m             [0;32mreturn[0m [0mself[0m[0;34m.[0m[0m_getbool_axis[0m[0;34m([0m[0mkey[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0maxis[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m   1091[0m         [0;32melif[0m [0mis_list_like_indexer[0m[0;34m([0m[0mkey[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m

[0;32m~/PycharmProjects/data_science/.venv/lib/python3.6/site-packages/pandas/core/common.py[0m in [0;36mis_bool_indexer[0;34m(key)[0m
[1;32m    132[0m                 [0mna_msg[0m [0;34m=[0m [0;34m"Cannot mask with non-boolean array containing NA / NaN values"[0m[0;34m[0m[0;34m[0m[0m
[1;32m    133[0m                 [0;32mif[0m [0misna[0m[0;34m([0m[0mkey[0m[0;34m)[0m[0;34m.[0m[0many[0m[0;34m([0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 134[0;31m                     [0;32mraise[0m [0mValueError[0m[0;34m([0m[0mna_msg[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m    135[0m                 [0;32mreturn[0m [0;32mFalse[0m[0;34m[0m[0;34m[0m[0m
[1;32m    136[0m             [0;32mreturn[0m [0;32mTrue[0m[0;34m[0m[0;34m[0m[0m

[0;31mValueError[0m: Cannot mask with non-boolean array containing NA / NaN values

参考文献 Reference

《Learn Enough Python to be Dangerous——Software Development, Flask Web Apps, and Beginning Data Science with Python》, Michael Hartl, Boston, Pearson, 2023.

本文标签: 数据处理科学数据DataPython