7.
Advanced Filter and Querying

Dr. W.J.B. Mattingly
Smithsonian Data Science Lab and United States Holocaust Memorial Museum
August 2021

7.1. Covered in this Chapter

  1. How and When to use the Filter() Function

  2. How and When to use the Query() Function

7.2. Introduction

In this chapter, we will meet two advanced ways of filtering or searching (querying) our data. These are aptly named Filter() and Query() functions. These two functions allow us to do some fairly advanced things within a narrow scope. By narrow scope, I mean the questions that we want to pose. Whenever you are manipulating or probing data, it is always best to think about the task as simply asking a question. In essence, this is precisely what you are doing. You are asking the database a question. In order to ask the question correctly, as is the case with any language, you need to know the correct syntax and when that particular question is the right one to ask. In this chapter, we explore how to frame specific questions with Filter() and Query().

Each function is used in particular circumstances. Filter() is useful for getting a large data down to a smaller size, based on the questions you want to ask. Query(), on the other hand, is useful for phrasing questions that use comparison operators (less than, equal to, greater than, etc.). Let’s explore each in turn, but first, let’s import our Titanic dataset.

import pandas as pd
df = pd.read_csv("data/titanic.csv")
df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

7.3. The Filter Function

The filter function is a great way to grab only the relevant columns. The syntax of filter is a bit easier to use. It takes a single argument, a list of strings. These strings correspond to the columns.

When to use Filter()
Use filter() when you want to get a quick sense of your dataset or, as we shall see, create a new dataframe based on the columns you want. It is particularly useful if your dataset has many columns. You can also use it to reorder your columns in a more desired way.

Let’s say that we are interested in just studying the names of the passengers of the Titanic. It does not make sense to work with the entire DataFrame. We can, therefore, use filter to just grab the names column, like so.

df.filter(["Name"])
Name
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
... ...
886 Montvila, Rev. Juozas
887 Graham, Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick

891 rows × 1 columns

This is great, but what if I also want to see the ages of these passengers. No problem. I can add an additional column to the list.

df.filter(["Name", "Age"])
Name Age
0 Braund, Mr. Owen Harris 22.0
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0
2 Heikkinen, Miss. Laina 26.0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0
4 Allen, Mr. William Henry 35.0
... ... ...
886 Montvila, Rev. Juozas 27.0
887 Graham, Miss. Margaret Edith 19.0
888 Johnston, Miss. Catherine Helen "Carrie" NaN
889 Behr, Mr. Karl Howell 26.0
890 Dooley, Mr. Patrick 32.0

891 rows × 2 columns

What if I want age to come before name? I can rearrange the order.

df.filter(["Age", "Name"])
Age Name
0 22.0 Braund, Mr. Owen Harris
1 38.0 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 26.0 Heikkinen, Miss. Laina
3 35.0 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 35.0 Allen, Mr. William Henry
... ... ...
886 27.0 Montvila, Rev. Juozas
887 19.0 Graham, Miss. Margaret Edith
888 NaN Johnston, Miss. Catherine Helen "Carrie"
889 26.0 Behr, Mr. Karl Howell
890 32.0 Dooley, Mr. Patrick

891 rows × 2 columns

Note that we are not bound to the order of the DataFrame. This is particularly useful if we want to make a new DataFrame.

new_df = df.filter(["Age", "Name"])
new_df
Age Name
0 22.0 Braund, Mr. Owen Harris
1 38.0 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 26.0 Heikkinen, Miss. Laina
3 35.0 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 35.0 Allen, Mr. William Henry
... ... ...
886 27.0 Montvila, Rev. Juozas
887 19.0 Graham, Miss. Margaret Edith
888 NaN Johnston, Miss. Catherine Helen "Carrie"
889 26.0 Behr, Mr. Karl Howell
890 32.0 Dooley, Mr. Patrick

891 rows × 2 columns

Now, we can only examine the data that we actually need. This will make our code faster and require examining fewer data for each row. Filter’s big limitation is in the fact that it cannot filter the data further. We cannot, for example, add an extra argument to filter() that would only return the Names with “Miss.”, but we can tack on additional arguments to filter, such as those that we saw in the previous chapter on searching strings.

Let’s try that now.

df.filter(["Age", "Name"]).Name.str.contains("Miss.")
0      False
1      False
2       True
3      False
4      False
       ...  
886    False
887     True
888     True
889    False
890    False
Name: Name, Length: 891, dtype: bool

Note that we now have a list of True False statements. These tell us if the word “Miss.” is in the column Name. If it is there, we see a True. If it is not, we see a False. Let’s say we want to know how many passengers have the title “Miss.”, we stack .value_counts() into the chain. Note the plural of counts.

df.filter(["Age", "Name"]).Name.str.contains("Miss.").value_counts()
False    709
True     182
Name: Name, dtype: int64

Now, let’s say we were interested in ONLY seeing the rows that contain “Miss.” in them. We need to structure that filtering as a list. But note that if we wrap the whole thing in a list, we don’t filter out just the Age and Name columns. Instead, we get the entire DataFrame.

df[df.filter(["Age", "Name"]).Name.str.contains("Miss.") == True]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
... ... ... ... ... ... ... ... ... ... ... ... ...
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S

182 rows × 12 columns

This is because of where our filter occurs in the chain of commands. Note that filter occurs within the brackets where we are setting up our parameters. This means that we are filtering under the conditions of how the list is created, but that once a row is processed, the DataFrame is unfiltered. If we want our filter to work, we need to place it after the conditions have been sorted. Notice that the filter is outside of our brackets now.

df[df.Name.str.contains("Miss.") == True].filter(["Age", "Name"])
Age Name
2 26.0 Heikkinen, Miss. Laina
10 4.0 Sandstrom, Miss. Marguerite Rut
11 58.0 Bonnell, Miss. Elizabeth
14 14.0 Vestrom, Miss. Hulda Amanda Adolfina
22 15.0 McGowan, Miss. Anna "Annie"
... ... ...
866 27.0 Duran y More, Miss. Asuncion
875 15.0 Najib, Miss. Adele Kiamie "Jane"
882 22.0 Dahlberg, Miss. Gerda Ulrika
887 19.0 Graham, Miss. Margaret Edith
888 NaN Johnston, Miss. Catherine Helen "Carrie"

182 rows × 2 columns

7.4. The Query Function

The Pandas Query() method is a fantastic way to filter and query data. Unlike other Pandas methods, it uses a string argument that functions rather similar to SQL syntax.

When to use Query
You should only use Query() when your question (query) can be posed as greater than, less than, equal to, or not equal to (or some combination of these). Let me demonstrate. If we wanted to filter out all the rows where the Pclass was equal to 3, we could use the following string.

df.query("Pclass == 3")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

491 rows × 12 columns

I can even stack questions together within this string. Let’s say, I am interested in all who were in Pclass 3 and survived. I could write the following string argument.

df.query("Pclass == 3 & Survived == 1")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
... ... ... ... ... ... ... ... ... ... ... ... ...
838 839 1 3 Chip, Mr. Chang male 32.0 0 0 1601 56.4958 NaN S
855 856 1 3 Aks, Mrs. Sam (Leah Rosen) female 18.0 0 1 392091 9.3500 NaN S
858 859 1 3 Baclini, Mrs. Solomon (Latifa Qurban) female 24.0 0 3 2666 19.2583 NaN C
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C

119 rows × 12 columns

Let’s make the question even more complex. I want to now find the number of these individuals who were over the age of 40.

df.query("Pclass == 3 & Survived == 1 & Age > 40")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
338 339 1 3 Dahl, Mr. Karl Edwart male 45.0 0 0 7598 8.0500 NaN S
414 415 1 3 Sundman, Mr. Johan Julian male 44.0 0 0 STON/O 2. 3101269 7.9250 NaN S
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S

I now have a list of 3 individuals who met all criteria. I can also use my Or operator (|), rather than &. Let’s see if we can achieve what we want. (Note this is an intentional mistake. Look below for why).

df.query("Pclass == 3 & Survived == 1 & Age > 40 | Age < 10")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
7 8 0 3 Palsson, Master. Gosta Leonard male 2.00 3 1 349909 21.0750 NaN S
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.00 1 1 PP 9549 16.7000 G6 S
16 17 0 3 Rice, Master. Eugene male 2.00 4 1 382652 29.1250 NaN Q
24 25 0 3 Palsson, Miss. Torborg Danira female 8.00 3 1 349909 21.0750 NaN S
43 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.00 1 2 SC/Paris 2123 41.5792 NaN C
... ... ... ... ... ... ... ... ... ... ... ... ...
827 828 1 2 Mallet, Master. Andre male 1.00 0 2 S.C./PARIS 2079 37.0042 NaN C
831 832 1 2 Richards, Master. George Sibley male 0.83 1 1 29106 18.7500 NaN S
850 851 0 3 Andersson, Master. Sigvard Harald Elias male 4.00 4 2 347082 31.2750 NaN S
852 853 0 3 Boulos, Miss. Nourelain female 9.00 1 1 2678 15.2458 NaN C
869 870 1 3 Johnson, Master. Harold Theodor male 4.00 1 1 347742 11.1333 NaN S

65 rows × 12 columns

Woops! Something has gone seriously wrong here. We have all different kinds of Pclasses, not just 3s. We have people who survived and did not. And, most problematically, we have people only under the age of 10. What has gone wrong here!? The answer lies in a perhaps forgotten part of math from when we were children, the order of operations. If you recall from those lessons, the order of operations determines the way in which you process the problem. 4 + 7 x 2 is very different from (4+7) x 2. The former is 18 and the latter is 22 because the latter has parentheses which tell the reader to do that operation first. Because programming sits on top of mathematics (especially Boolean algebra), the syntax of mathematics is often embedded in programming.

Let’s use the order of operations correctly and rephrase our query. Note the parentheses now before Age > 40 and after Age < 10. Note also that the & is before the parentheses.

df.query("Pclass == 3 & Survived == 1 & (Age > 40 | Age < 10)")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.00 1 1 PP 9549 16.7000 G6 S
165 166 1 3 Goldsmith, Master. Frank John William "Frankie" male 9.00 0 2 363291 20.5250 NaN S
172 173 1 3 Johnson, Miss. Eleanor Ileen female 1.00 1 1 347742 11.1333 NaN S
184 185 1 3 Kink-Heilmann, Miss. Luise Gretchen female 4.00 0 2 315153 22.0250 NaN S
233 234 1 3 Asplund, Miss. Lillian Gertrud female 5.00 4 2 347077 31.3875 NaN S
261 262 1 3 Asplund, Master. Edvin Rojj Felix male 3.00 4 2 347077 31.3875 NaN S
338 339 1 3 Dahl, Mr. Karl Edwart male 45.00 0 0 7598 8.0500 NaN S
348 349 1 3 Coutts, Master. William Loch "William" male 3.00 1 1 C.A. 37671 15.9000 NaN S
381 382 1 3 Nakid, Miss. Maria ("Mary") female 1.00 0 2 2653 15.7417 NaN C
414 415 1 3 Sundman, Mr. Johan Julian male 44.00 0 0 STON/O 2. 3101269 7.9250 NaN S
448 449 1 3 Baclini, Miss. Marie Catherine female 5.00 2 1 2666 19.2583 NaN C
469 470 1 3 Baclini, Miss. Helene Barbara female 0.75 2 1 2666 19.2583 NaN C
479 480 1 3 Hirvonen, Miss. Hildur E female 2.00 0 1 3101298 12.2875 NaN S
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.00 0 0 4134 9.5875 NaN S
489 490 1 3 Coutts, Master. Eden Leslie "Neville" male 9.00 1 1 C.A. 37671 15.9000 NaN S
644 645 1 3 Baclini, Miss. Eugenie female 0.75 2 1 2666 19.2583 NaN C
691 692 1 3 Karun, Miss. Manca female 4.00 0 1 349256 13.4167 NaN C
751 752 1 3 Moor, Master. Meier male 6.00 0 1 392096 12.4750 E121 S
777 778 1 3 Emanuel, Miss. Virginia Ethel female 5.00 0 0 364516 12.4750 NaN S
788 789 1 3 Dean, Master. Bertram Vere male 1.00 1 2 C.A. 2315 20.5750 NaN S
803 804 1 3 Thomas, Master. Assad Alexander male 0.42 0 1 2625 8.5167 NaN C
869 870 1 3 Johnson, Master. Harold Theodor male 4.00 1 1 347742 11.1333 NaN S

This is why query is such a powerful function in Pandas. You can do a lot with a single string. There are other ways to achieve this same result, but if your question cna be entirely phrased as a series of comparison operators (equal to, less than, etc.), then Query is likely the best option.