7. Advanced Filter and Querying ¶
7.1. Covered in this Chapter¶
How and When to use the Filter() Function
How and When to use the Query() Function
7.2. Introduction¶
In this chapter, we will meet two advanced ways of filtering or searching (querying) our data. These are aptly named Filter() and Query() functions. These two functions allow us to do some fairly advanced things within a narrow scope. By narrow scope, I mean the questions that we want to pose. Whenever you are manipulating or probing data, it is always best to think about the task as simply asking a question. In essence, this is precisely what you are doing. You are asking the database a question. In order to ask the question correctly, as is the case with any language, you need to know the correct syntax and when that particular question is the right one to ask. In this chapter, we explore how to frame specific questions with Filter() and Query().
Each function is used in particular circumstances. Filter() is useful for getting a large data down to a smaller size, based on the questions you want to ask. Query(), on the other hand, is useful for phrasing questions that use comparison operators (less than, equal to, greater than, etc.). Let’s explore each in turn, but first, let’s import our Titanic dataset.
import pandas as pd
df = pd.read_csv("data/titanic.csv")
df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
7.3. The Filter Function¶
The filter function is a great way to grab only the relevant columns. The syntax of filter is a bit easier to use. It takes a single argument, a list of strings. These strings correspond to the columns.
When to use Filter()
Use filter() when you want to get a quick sense of your dataset or, as we shall see, create a new dataframe based on the columns you want. It is particularly useful if your dataset has many columns. You can also use it to reorder your columns in a more desired way.
Let’s say that we are interested in just studying the names of the passengers of the Titanic. It does not make sense to work with the entire DataFrame. We can, therefore, use filter to just grab the names column, like so.
df.filter(["Name"])
Name | |
---|---|
0 | Braund, Mr. Owen Harris |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... |
2 | Heikkinen, Miss. Laina |
3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
4 | Allen, Mr. William Henry |
... | ... |
886 | Montvila, Rev. Juozas |
887 | Graham, Miss. Margaret Edith |
888 | Johnston, Miss. Catherine Helen "Carrie" |
889 | Behr, Mr. Karl Howell |
890 | Dooley, Mr. Patrick |
891 rows × 1 columns
This is great, but what if I also want to see the ages of these passengers. No problem. I can add an additional column to the list.
df.filter(["Name", "Age"])
Name | Age | |
---|---|---|
0 | Braund, Mr. Owen Harris | 22.0 |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 |
2 | Heikkinen, Miss. Laina | 26.0 |
3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 |
4 | Allen, Mr. William Henry | 35.0 |
... | ... | ... |
886 | Montvila, Rev. Juozas | 27.0 |
887 | Graham, Miss. Margaret Edith | 19.0 |
888 | Johnston, Miss. Catherine Helen "Carrie" | NaN |
889 | Behr, Mr. Karl Howell | 26.0 |
890 | Dooley, Mr. Patrick | 32.0 |
891 rows × 2 columns
What if I want age to come before name? I can rearrange the order.
df.filter(["Age", "Name"])
Age | Name | |
---|---|---|
0 | 22.0 | Braund, Mr. Owen Harris |
1 | 38.0 | Cumings, Mrs. John Bradley (Florence Briggs Th... |
2 | 26.0 | Heikkinen, Miss. Laina |
3 | 35.0 | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
4 | 35.0 | Allen, Mr. William Henry |
... | ... | ... |
886 | 27.0 | Montvila, Rev. Juozas |
887 | 19.0 | Graham, Miss. Margaret Edith |
888 | NaN | Johnston, Miss. Catherine Helen "Carrie" |
889 | 26.0 | Behr, Mr. Karl Howell |
890 | 32.0 | Dooley, Mr. Patrick |
891 rows × 2 columns
Note that we are not bound to the order of the DataFrame. This is particularly useful if we want to make a new DataFrame.
new_df = df.filter(["Age", "Name"])
new_df
Age | Name | |
---|---|---|
0 | 22.0 | Braund, Mr. Owen Harris |
1 | 38.0 | Cumings, Mrs. John Bradley (Florence Briggs Th... |
2 | 26.0 | Heikkinen, Miss. Laina |
3 | 35.0 | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
4 | 35.0 | Allen, Mr. William Henry |
... | ... | ... |
886 | 27.0 | Montvila, Rev. Juozas |
887 | 19.0 | Graham, Miss. Margaret Edith |
888 | NaN | Johnston, Miss. Catherine Helen "Carrie" |
889 | 26.0 | Behr, Mr. Karl Howell |
890 | 32.0 | Dooley, Mr. Patrick |
891 rows × 2 columns
Now, we can only examine the data that we actually need. This will make our code faster and require examining fewer data for each row. Filter’s big limitation is in the fact that it cannot filter the data further. We cannot, for example, add an extra argument to filter() that would only return the Names with “Miss.”, but we can tack on additional arguments to filter, such as those that we saw in the previous chapter on searching strings.
Let’s try that now.
df.filter(["Age", "Name"]).Name.str.contains("Miss.")
0 False
1 False
2 True
3 False
4 False
...
886 False
887 True
888 True
889 False
890 False
Name: Name, Length: 891, dtype: bool
Note that we now have a list of True False statements. These tell us if the word “Miss.” is in the column Name. If it is there, we see a True. If it is not, we see a False. Let’s say we want to know how many passengers have the title “Miss.”, we stack .value_counts() into the chain. Note the plural of counts.
df.filter(["Age", "Name"]).Name.str.contains("Miss.").value_counts()
False 709
True 182
Name: Name, dtype: int64
Now, let’s say we were interested in ONLY seeing the rows that contain “Miss.” in them. We need to structure that filtering as a list. But note that if we wrap the whole thing in a list, we don’t filter out just the Age and Name columns. Instead, we get the entire DataFrame.
df[df.filter(["Age", "Name"]).Name.str.contains("Miss.") == True]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | NaN | S |
22 | 23 | 1 | 3 | McGowan, Miss. Anna "Annie" | female | 15.0 | 0 | 0 | 330923 | 8.0292 | NaN | Q |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
866 | 867 | 1 | 2 | Duran y More, Miss. Asuncion | female | 27.0 | 1 | 0 | SC/PARIS 2149 | 13.8583 | NaN | C |
875 | 876 | 1 | 3 | Najib, Miss. Adele Kiamie "Jane" | female | 15.0 | 0 | 0 | 2667 | 7.2250 | NaN | C |
882 | 883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
182 rows × 12 columns
This is because of where our filter occurs in the chain of commands. Note that filter occurs within the brackets where we are setting up our parameters. This means that we are filtering under the conditions of how the list is created, but that once a row is processed, the DataFrame is unfiltered. If we want our filter to work, we need to place it after the conditions have been sorted. Notice that the filter is outside of our brackets now.
df[df.Name.str.contains("Miss.") == True].filter(["Age", "Name"])
Age | Name | |
---|---|---|
2 | 26.0 | Heikkinen, Miss. Laina |
10 | 4.0 | Sandstrom, Miss. Marguerite Rut |
11 | 58.0 | Bonnell, Miss. Elizabeth |
14 | 14.0 | Vestrom, Miss. Hulda Amanda Adolfina |
22 | 15.0 | McGowan, Miss. Anna "Annie" |
... | ... | ... |
866 | 27.0 | Duran y More, Miss. Asuncion |
875 | 15.0 | Najib, Miss. Adele Kiamie "Jane" |
882 | 22.0 | Dahlberg, Miss. Gerda Ulrika |
887 | 19.0 | Graham, Miss. Margaret Edith |
888 | NaN | Johnston, Miss. Catherine Helen "Carrie" |
182 rows × 2 columns
7.4. The Query Function¶
The Pandas Query() method is a fantastic way to filter and query data. Unlike other Pandas methods, it uses a string argument that functions rather similar to SQL syntax.
When to use Query
You should only use Query() when your question (query) can be posed as greater than, less than, equal to, or not equal to (or some combination of these). Let me demonstrate. If we wanted to filter out all the rows where the Pclass was equal to 3, we could use the following string.
df.query("Pclass == 3")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
882 | 883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
884 | 885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
491 rows × 12 columns
I can even stack questions together within this string. Let’s say, I am interested in all who were in Pclass 3 and survived. I could write the following string argument.
df.query("Pclass == 3 & Survived == 1")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
22 | 23 | 1 | 3 | McGowan, Miss. Anna "Annie" | female | 15.0 | 0 | 0 | 330923 | 8.0292 | NaN | Q |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
838 | 839 | 1 | 3 | Chip, Mr. Chang | male | 32.0 | 0 | 0 | 1601 | 56.4958 | NaN | S |
855 | 856 | 1 | 3 | Aks, Mrs. Sam (Leah Rosen) | female | 18.0 | 0 | 1 | 392091 | 9.3500 | NaN | S |
858 | 859 | 1 | 3 | Baclini, Mrs. Solomon (Latifa Qurban) | female | 24.0 | 0 | 3 | 2666 | 19.2583 | NaN | C |
869 | 870 | 1 | 3 | Johnson, Master. Harold Theodor | male | 4.0 | 1 | 1 | 347742 | 11.1333 | NaN | S |
875 | 876 | 1 | 3 | Najib, Miss. Adele Kiamie "Jane" | female | 15.0 | 0 | 0 | 2667 | 7.2250 | NaN | C |
119 rows × 12 columns
Let’s make the question even more complex. I want to now find the number of these individuals who were over the age of 40.
df.query("Pclass == 3 & Survived == 1 & Age > 40")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
338 | 339 | 1 | 3 | Dahl, Mr. Karl Edwart | male | 45.0 | 0 | 0 | 7598 | 8.0500 | NaN | S |
414 | 415 | 1 | 3 | Sundman, Mr. Johan Julian | male | 44.0 | 0 | 0 | STON/O 2. 3101269 | 7.9250 | NaN | S |
483 | 484 | 1 | 3 | Turkula, Mrs. (Hedwig) | female | 63.0 | 0 | 0 | 4134 | 9.5875 | NaN | S |
I now have a list of 3 individuals who met all criteria. I can also use my Or operator (|), rather than &. Let’s see if we can achieve what we want. (Note this is an intentional mistake. Look below for why).
df.query("Pclass == 3 & Survived == 1 & Age > 40 | Age < 10")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.00 | 3 | 1 | 349909 | 21.0750 | NaN | S |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.00 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.00 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
24 | 25 | 0 | 3 | Palsson, Miss. Torborg Danira | female | 8.00 | 3 | 1 | 349909 | 21.0750 | NaN | S |
43 | 44 | 1 | 2 | Laroche, Miss. Simonne Marie Anne Andree | female | 3.00 | 1 | 2 | SC/Paris 2123 | 41.5792 | NaN | C |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
827 | 828 | 1 | 2 | Mallet, Master. Andre | male | 1.00 | 0 | 2 | S.C./PARIS 2079 | 37.0042 | NaN | C |
831 | 832 | 1 | 2 | Richards, Master. George Sibley | male | 0.83 | 1 | 1 | 29106 | 18.7500 | NaN | S |
850 | 851 | 0 | 3 | Andersson, Master. Sigvard Harald Elias | male | 4.00 | 4 | 2 | 347082 | 31.2750 | NaN | S |
852 | 853 | 0 | 3 | Boulos, Miss. Nourelain | female | 9.00 | 1 | 1 | 2678 | 15.2458 | NaN | C |
869 | 870 | 1 | 3 | Johnson, Master. Harold Theodor | male | 4.00 | 1 | 1 | 347742 | 11.1333 | NaN | S |
65 rows × 12 columns
Woops! Something has gone seriously wrong here. We have all different kinds of Pclasses, not just 3s. We have people who survived and did not. And, most problematically, we have people only under the age of 10. What has gone wrong here!? The answer lies in a perhaps forgotten part of math from when we were children, the order of operations. If you recall from those lessons, the order of operations determines the way in which you process the problem. 4 + 7 x 2 is very different from (4+7) x 2. The former is 18 and the latter is 22 because the latter has parentheses which tell the reader to do that operation first. Because programming sits on top of mathematics (especially Boolean algebra), the syntax of mathematics is often embedded in programming.
Let’s use the order of operations correctly and rephrase our query. Note the parentheses now before Age > 40 and after Age < 10. Note also that the & is before the parentheses.
df.query("Pclass == 3 & Survived == 1 & (Age > 40 | Age < 10)")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.00 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
165 | 166 | 1 | 3 | Goldsmith, Master. Frank John William "Frankie" | male | 9.00 | 0 | 2 | 363291 | 20.5250 | NaN | S |
172 | 173 | 1 | 3 | Johnson, Miss. Eleanor Ileen | female | 1.00 | 1 | 1 | 347742 | 11.1333 | NaN | S |
184 | 185 | 1 | 3 | Kink-Heilmann, Miss. Luise Gretchen | female | 4.00 | 0 | 2 | 315153 | 22.0250 | NaN | S |
233 | 234 | 1 | 3 | Asplund, Miss. Lillian Gertrud | female | 5.00 | 4 | 2 | 347077 | 31.3875 | NaN | S |
261 | 262 | 1 | 3 | Asplund, Master. Edvin Rojj Felix | male | 3.00 | 4 | 2 | 347077 | 31.3875 | NaN | S |
338 | 339 | 1 | 3 | Dahl, Mr. Karl Edwart | male | 45.00 | 0 | 0 | 7598 | 8.0500 | NaN | S |
348 | 349 | 1 | 3 | Coutts, Master. William Loch "William" | male | 3.00 | 1 | 1 | C.A. 37671 | 15.9000 | NaN | S |
381 | 382 | 1 | 3 | Nakid, Miss. Maria ("Mary") | female | 1.00 | 0 | 2 | 2653 | 15.7417 | NaN | C |
414 | 415 | 1 | 3 | Sundman, Mr. Johan Julian | male | 44.00 | 0 | 0 | STON/O 2. 3101269 | 7.9250 | NaN | S |
448 | 449 | 1 | 3 | Baclini, Miss. Marie Catherine | female | 5.00 | 2 | 1 | 2666 | 19.2583 | NaN | C |
469 | 470 | 1 | 3 | Baclini, Miss. Helene Barbara | female | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | C |
479 | 480 | 1 | 3 | Hirvonen, Miss. Hildur E | female | 2.00 | 0 | 1 | 3101298 | 12.2875 | NaN | S |
483 | 484 | 1 | 3 | Turkula, Mrs. (Hedwig) | female | 63.00 | 0 | 0 | 4134 | 9.5875 | NaN | S |
489 | 490 | 1 | 3 | Coutts, Master. Eden Leslie "Neville" | male | 9.00 | 1 | 1 | C.A. 37671 | 15.9000 | NaN | S |
644 | 645 | 1 | 3 | Baclini, Miss. Eugenie | female | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | C |
691 | 692 | 1 | 3 | Karun, Miss. Manca | female | 4.00 | 0 | 1 | 349256 | 13.4167 | NaN | C |
751 | 752 | 1 | 3 | Moor, Master. Meier | male | 6.00 | 0 | 1 | 392096 | 12.4750 | E121 | S |
777 | 778 | 1 | 3 | Emanuel, Miss. Virginia Ethel | female | 5.00 | 0 | 0 | 364516 | 12.4750 | NaN | S |
788 | 789 | 1 | 3 | Dean, Master. Bertram Vere | male | 1.00 | 1 | 2 | C.A. 2315 | 20.5750 | NaN | S |
803 | 804 | 1 | 3 | Thomas, Master. Assad Alexander | male | 0.42 | 0 | 1 | 2625 | 8.5167 | NaN | C |
869 | 870 | 1 | 3 | Johnson, Master. Harold Theodor | male | 4.00 | 1 | 1 | 347742 | 11.1333 | NaN | S |
This is why query is such a powerful function in Pandas. You can do a lot with a single string. There are other ways to achieve this same result, but if your question cna be entirely phrased as a series of comparison operators (equal to, less than, etc.), then Query is likely the best option.