Pandas How to Choose Columns to Read
In [1]: import pandas as pd
-
Titanic data
This tutorial uses the Titanic information set, stored every bit CSV. The information consists of the post-obit information columns:
-
PassengerId: Id of every passenger.
-
Survived: This characteristic take value 0 and 1. 0 for not survived and i for survived.
-
Pclass: In that location are iii classes: Class 1, Grade 2 and Class 3.
-
Proper noun: Proper name of passenger.
-
Sex: Gender of passenger.
-
Age: Historic period of rider.
-
SibSp: Indication that rider have siblings and spouse.
-
Parch: Whether a passenger is lonely or have family unit.
-
Ticket: Ticket number of passenger.
-
Fare: Indicating the fare.
-
Cabin: The cabin of passenger.
-
Embarked: The embarked category.
To raw data
In [ii]: titanic = pd . read_csv ( "data/titanic.csv" ) In [3]: titanic . caput () Out[3]: PassengerId Survived Pclass Name ... Ticket Fare Motel Embarked 0 ane 0 3 Braund, Mr. Owen Harris ... A/5 21171 7.2500 NaN South ane 2 one 1 Cumings, Mrs. John Bradley (Florence Briggs Thursday... ... PC 17599 71.2833 C85 C 2 three 1 three Heikkinen, Miss. Laina ... STON/O2. 3101282 vii.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ... 113803 53.1000 C123 Due south 4 5 0 three Allen, Mr. William Henry ... 373450 eight.0500 NaN Due south [5 rows x 12 columns]
-
How do I select a subset of a DataFrame
?¶
How do I select specific columns from a DataFrame
?¶
-
I'one thousand interested in the historic period of the Titanic passengers.
In [4]: ages = titanic [ "Historic period" ] In [5]: ages . head () Out[5]: 0 22.0 i 38.0 2 26.0 3 35.0 4 35.0 Proper noun: Age, dtype: float64
To select a single cavalcade, utilize foursquare brackets
[]
with the column proper noun of the column of interest.
Each cavalcade in a DataFrame
is a Series
. As a single cavalcade is selected, the returned object is a pandas Series
. Nosotros can verify this by checking the type of the output:
In [6]: type ( titanic [ "Age" ]) Out[6]: pandas.core.series.Series
And accept a expect at the shape
of the output:
In [7]: titanic [ "Age" ] . shape Out[7]: (891,)
DataFrame.shape
is an aspect (call up tutorial on reading and writing, do not use parentheses for attributes) of a pandas Serial
and DataFrame
containing the number of rows and columns: (nrows, ncolumns). A pandas Serial is 1-dimensional and but the number of rows is returned.
-
I'm interested in the age and sex of the Titanic passengers.
In [8]: age_sex = titanic [[ "Age" , "Sex" ]] In [9]: age_sex . caput () Out[nine]: Age Sex activity 0 22.0 male 1 38.0 female person 2 26.0 female iii 35.0 female four 35.0 male
To select multiple columns, use a list of column names within the selection brackets
[]
.
Notation
The inner square brackets define a Python list with column names, whereas the outer brackets are used to select the data from a pandas DataFrame
as seen in the previous example.
The returned data blazon is a pandas DataFrame:
In [x]: type ( titanic [[ "Age" , "Sex" ]]) Out[ten]: pandas.cadre.frame.DataFrame
In [11]: titanic [[ "Historic period" , "Sex activity" ]] . shape Out[11]: (891, 2)
The selection returned a DataFrame
with 891 rows and 2 columns. Call up, a DataFrame
is 2-dimensional with both a row and column dimension.
How do I filter specific rows from a DataFrame
?¶
-
I'1000 interested in the passengers older than 35 years.
In [12]: above_35 = titanic [ titanic [ "Age" ] > 35 ] In [13]: above_35 . head () Out[13]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked i ii 1 one Cumings, Mrs. John Bradley (Florence Briggs Th... female person ... 0 PC 17599 71.2833 C85 C 6 vii 0 i McCarthy, Mr. Timothy J male ... 0 17463 51.8625 E46 S 11 12 1 1 Bonnell, Miss. Elizabeth female ... 0 113783 26.5500 C103 Southward xiii fourteen 0 3 Andersson, Mr. Anders Johan male ... five 347082 31.2750 NaN S 15 sixteen ane 2 Hewlett, Mrs. (Mary D Kingcome) female person ... 0 248706 16.0000 NaN South [5 rows ten 12 columns]
To select rows based on a conditional expression, utilise a condition within the selection brackets
[]
.
The status inside the choice brackets titanic["Age"] > 35
checks for which rows the Age
column has a value larger than 35:
In [14]: titanic [ "Historic period" ] > 35 Out[14]: 0 False 1 True two False iii False 4 Fake ... 886 False 887 False 888 False 889 False 890 Simulated Proper name: Age, Length: 891, dtype: bool
The output of the provisional expression ( >
, but likewise ==
, !=
, <
, <=
,… would work) is actually a pandas Serial
of boolean values (either True
or False
) with the same number of rows as the original DataFrame
. Such a Serial
of boolean values can be used to filter the DataFrame
by putting it in betwixt the selection brackets []
. Just rows for which the value is True
will be selected.
Nosotros know from before that the original Titanic DataFrame
consists of 891 rows. Let's have a expect at the number of rows which satisfy the condition by checking the shape
attribute of the resulting DataFrame
above_35
:
In [15]: above_35 . shape Out[15]: (217, 12)
-
I'm interested in the Titanic passengers from cabin class ii and 3.
In [16]: class_23 = titanic [ titanic [ "Pclass" ] . isin ([ 2 , 3 ])] In [17]: class_23 . head () Out[17]: PassengerId Survived Pclass Proper noun Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S two 3 one 3 Heikkinen, Miss. Laina female person 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 4 five 0 iii Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN Due south 5 vi 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q vii viii 0 3 Palsson, Master. Gosta Leonard male two.0 3 1 349909 21.0750 NaN South
Similar to the provisional expression, the
isin()
conditional function returns aTrue
for each row the values are in the provided list. To filter the rows based on such a function, use the provisional function inside the option brackets[]
. In this case, the condition within the selection bracketstitanic["Pclass"].isin([2, iii])
checks for which rows thePclass
column is either 2 or iii.
The above is equivalent to filtering by rows for which the class is either ii or iii and combining the two statements with an |
(or) operator:
In [18]: class_23 = titanic [( titanic [ "Pclass" ] == 2 ) | ( titanic [ "Pclass" ] == 3 )] In [19]: class_23 . caput () Out[xix]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 i 0 iii Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 3 1 iii Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 vii.9250 NaN S iv 5 0 iii Allen, Mr. William Henry male 35.0 0 0 373450 eight.0500 NaN South five six 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q vii 8 0 iii Palsson, Master. Gosta Leonard male person ii.0 3 1 349909 21.0750 NaN Due south
Note
When combining multiple provisional statements, each condition must be surrounded by parentheses ()
. Moreover, yous can not utilize or
/ and
but demand to use the or
operator |
and the and
operator &
.
-
I want to work with passenger information for which the age is known.
In [20]: age_no_na = titanic [ titanic [ "Age" ] . notna ()] In [21]: age_no_na . head () Out[21]: PassengerId Survived Pclass Name ... Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris ... A/5 21171 7.2500 NaN S 1 2 1 i Cumings, Mrs. John Bradley (Florence Briggs Th... ... PC 17599 71.2833 C85 C 2 three 1 3 Heikkinen, Miss. Laina ... STON/O2. 3101282 vii.9250 NaN Southward 3 iv ane 1 Futrelle, Mrs. Jacques Heath (Lily May Pare) ... 113803 53.thou C123 S iv v 0 3 Allen, Mr. William Henry ... 373450 8.0500 NaN S [5 rows ten 12 columns]
The
notna()
provisional function returns aTrue
for each row the values are non anNull
value. As such, this tin can exist combined with the choice brackets[]
to filter the data tabular array.
You might wonder what actually changed, as the first 5 lines are even so the same values. One way to verify is to cheque if the shape has changed:
In [22]: age_no_na . shape Out[22]: (714, 12)
To user guide
For more dedicated functions on missing values, run into the user guide department virtually handling missing data.
How do I select specific rows and columns from a DataFrame
?¶
-
I'm interested in the names of the passengers older than 35 years.
In [23]: adult_names = titanic . loc [ titanic [ "Historic period" ] > 35 , "Name" ] In [24]: adult_names . head () Out[24]: 1 Cumings, Mrs. John Bradley (Florence Briggs Thursday... vi McCarthy, Mr. Timothy J eleven Bonnell, Miss. Elizabeth 13 Andersson, Mr. Anders Johan 15 Hewlett, Mrs. (Mary D Kingcome) Name: Proper noun, dtype: object
In this case, a subset of both rows and columns is fabricated in one go and only using selection brackets
[]
is not sufficient anymore. Theloc
/iloc
operators are required in front of the selection brackets[]
. When usingloc
/iloc
, the role before the comma is the rows you want, and the office after the comma is the columns you lot desire to select.
When using the column names, row labels or a condition expression, employ the loc
operator in front of the option brackets []
. For both the part earlier and after the comma, you lot can use a single characterization, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies y'all desire to select all rows or columns.
-
I'k interested in rows 10 till 25 and columns 3 to v.
In [25]: titanic . iloc [ 9 : 25 , 2 : v ] Out[25]: Pclass Proper noun Sex 9 two Nasser, Mrs. Nicholas (Adele Achem) female x three Sandstrom, Miss. Marguerite Heat female person 11 i Bonnell, Miss. Elizabeth female 12 3 Saundercock, Mr. William Henry male 13 3 Andersson, Mr. Anders Johan male .. ... ... ... 20 2 Fynney, Mr. Joseph J male 21 2 Beesley, Mr. Lawrence male 22 3 McGowan, Miss. Anna "Annie" female 23 ane Sloper, Mr. William Thompson male person 24 3 Palsson, Miss. Torborg Danira female [xvi rows x three columns]
Again, a subset of both rows and columns is made in one go and just using pick brackets
[]
is non sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the tabular array, use theiloc
operator in front end of the selection brackets[]
.
When selecting specific rows and/or columns with loc
or iloc
, new values can be assigned to the selected data. For example, to assign the proper name bearding
to the first 3 elements of the third column:
In [26]: titanic . iloc [ 0 : 3 , 3 ] = "anonymous" In [27]: titanic . caput () Out[27]: PassengerId Survived Pclass Name ... Ticket Fare Cabin Embarked 0 1 0 3 anonymous ... A/five 21171 7.2500 NaN South ane 2 i one anonymous ... PC 17599 71.2833 C85 C 2 3 1 three anonymous ... STON/O2. 3101282 7.9250 NaN S three iv 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ... 113803 53.1000 C123 S 4 five 0 iii Allen, Mr. William Henry ... 373450 8.0500 NaN S [five rows x 12 columns]
REMEMBER
-
When selecting subsets of data, square brackets
[]
are used. -
Inside these brackets, you can utilize a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
-
Select specific rows and/or columns using
loc
when using the row and column names -
Select specific rows and/or columns using
iloc
when using the positions in the table -
You can assign new values to a selection based on
loc
/iloc
.
Source: https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html
0 Response to "Pandas How to Choose Columns to Read"
Post a Comment