All of the details of the calculation using sympy can be found here.
]]>For our note let’s settle on a single simple example problem.
We have two samples of real numbers
a_1, ..., a_n
andb_1, ..., b_n
. All thea_i
are mutually exchangeable or generated by an independent and identically distributed random process. All theb_i
are also such.We want to know are the a_i exchangeable (essentially the same as) the b_i or not?
For a concrete example: suppose the a_i are the profits observed on one investment strategy and the b_i are the profits observed on an alternate strategy. We would like to know if the strategies are different, and if so which one has been historically better.
Now the above is throughly solved in testing the difference of means (such as ANOVA) and also as an A/B testing question. However, let’s look at this directly in very basic terms.
Some basic concepts that can carry the desired analysis are the following:
x = x_1, ..., x_n
the mean is defined as mean(x) := (1/n) sum_{i=1 ... n} x_i
. This can be thought of as one measure of what a “typical” x_i
is. There are other measures of typicality, such as the median, but let’s stick to mean.
sd
. The standard deviation is a measure of how much individual vary. It is defined as sd(x) := sqrt((1/(n-1)) sum_{i=1 ... n} (x_i - mean(x))^2)
. The standard deviation is essentially the square-root of the expected (or average) distance squared between examples and the mean. This is an estimate of how far individuals usually are from “typical.”
se
. The standard error is a measure of how much uncertainty we have in our estimate of the mean. Usually it is estimated as se(x) = sd(x) / sqrt(n)
. The idea is:a = a_1, ..., a_n
and b = b_1, ..., b_n
were in fact exchangeable (generated by indistinguishable means), thenmean(b) - mean(a)
to be typically around the size of se(a)
or se(b)
(se(a)
and se(b)
themselves should be similar under the assumptions a
and b
are identical, low standard deviation, and n
is large).
p
-value. The statistical significance p
can be defined for any summary of data. If f()
is our summary, t
is a given threshold, then the one-sided significance p
of f(), t
is defined such that p = PR[f(b) - f(a) >= t]
under the assumption a
and b
are independently identically distributed. The idea is: observing p
values is unlikely under chance, so data that achieves a low p
may in fact be due to a cause. How one estimates or calculates p
is, of course, of interest. However, the properties of the value are determined by the definition.
The detailed question is: how do different summaries characterize the observed difference between a
and b
? We would like to know this in two cases:
a
and b
were generated the same way, there is no essential difference. This situation is often called a null hypothesis. It is usually something we hope isn’t the case, or a situation we are trying to disprove or eliminate.
a
and b
are generated in different ways. Usually we are hoping to confirm b
is better than a
, in the sense that mean(b) > mean(a)
. Though, we can have the case that a
and b
are generated in different ways, but still have the same expected value or ideal mean.
What we want is: a measure to what degree is the observed mean(a)
greater than the observed mean(b)
. It is typical to use Case 0 as a stand-in for “the means are not different” (though that isn’t competely precise, as different distributions may have the same mean).
Typical measures include:
, the observed difference in means. Let’s call this “d
“.
(mean(b) - mean(a))/se
, the observed difference in means scaled by the standard error.se
is estimated, though there are details and variations. Typically a pooled estimate such as se = se(a_1, ..., a_n, b_1, ..., b_n)
is used. This summary is usually called a standard score (“z
“) or “t
“. For this note it isn’t important exactly how se
is estimated, though there are details and variations. Typically a pooled estimate such as se = se(a_1, ..., a_n, b_1, ..., b_n)
is used. This is a very common statistic or summary. Let’s call this “d/se
“.
(mean(b) - mean(a))/sd
, the observed difference in means scaled by the standard deviation.sd
is estimated, though there are details and variations. Typically a pooled estimate such as sd = sd(a_1, ..., a_n, b_1, ..., b_n)
is used. This summary is usually called an effect size, the most famous of which being Cohen’s d. This is among the most useful statistics, and often rediscovered as it is under taught outside the statistically oriented sciences. Let’s call this “d/sd
“.
p
or significance for a threshold difference in summary statistics. For a difference in any of the above summaries (or many more) we can calculate the significance or p
-value, and use this as measure of interest. We advise against this, but it is worth discussing. Let’s call any of these (they all behave the same due to their shared definition) “p
“.
Having set up our notation we can characterize our four summary families (d
, d/se
, d/sd
, p
).
Under moderate assumptions (bounded and non-zero standard deviation) as our sample size n
gets large we expect the following for each of these summaries.
. | No actual difference (null hypothesis) | mean(b) > mean(a) (what we are hoping for) |
sd(b) – sd(a) (difference of standard deviations) | distributed with mean 0 | distributed with constant mean, constant standard deviation |
se(b) – se(a) (differenece of standard errors) |
distributed with mean 0 |
distributed with mean approaching 0, standard deviation approaching 0 |
d (raw difference in means) | distributed mean 0 | distributed with constant mean, standard deviation approaching 0 |
d / sd (effect size) |
distributed mean 0 |
distributed with mean converging to a constant (the effect size), and standard deviation approaching 0 |
d / se (standard score) | distributed mean 0 | distributed with mean going to infinity |
p (significance) |
distributed uniformly in the interval [0, 1] |
distribution converges to zero |
And this brings us to our point. All of the cells marked in red carry only a single bit of information at the surface level (that is extractable without knowledge of the sample size n
): “there is a difference”. In contract, the effect size (in green) converges to a quantity that has conventional detailed interpretations.
A standard table of effect sizes is given below (source: Wikipedia) is copied below.
Effect size | d | Reference |
---|---|---|
Very small | 0.01 | ^{[9]} |
Small | 0.20 | ^{[8]} |
Medium | 0.50 | ^{[8]} |
Large | 0.80 | ^{[8]} |
Very large | 1.20 | ^{[9]} |
Huge | 2.0 | ^{[9]} |
Also note for p
there is no strong guarantee p
is large in the null hypothesis case, as p
is not a concentrated distribution in this situation.
In my experience, “standard scores” (z
, t
) and significances (p
) are more commonly taught to statistical outsiders than effect sizes. The social sciences however are particularly strong in using effect sizes well, as Jacob Cohen was both a psychologist and statistician (ref). Effect sizes, such as Cohen’s d, are very useful; so they keep getting profitably re-introduced by practitioners in many other fields.
For more on measures of correspondence, please see here.
]]>The data algebra is realization of the Codd relational algebra for data in written in terms of Python method chaining. It allows the concise clear specification of useful data transforms. Some examples can be found here. Benefits include being able to specify a single data transformation that can then be translated and executed in many realizations, currently including Pandas, Google Big Query, PostgreSQL, Spark, and SQLite. It allows you to rehearse and debug your big data work in memory.
Some noteable features of the 0.9.0 PyPi release include:
WITH
operator for more better machine generated SQL.
We’ve been using the data algebra to speed up development on both client and internal Python data science projects. I invite you to give it a try.
]]>The data algebra is a tool for data processing in Python which is implemented on top of any of Pandas, Google BigQuery, PostgreSQL, MySQL, Spark, and SQLite. It allows you to develop data processing pipelines incrementally and then use and re-use them on different data sets in different data stores.
Please check it out.
]]>Now I kind of wonder what Pandas is, or what it wants to be.
The version 1.3.0
package seems to be marking natural ways to work with a data frame as “low performance” and issuing warnings (in some situations over and over again).
It is now considered rude to insert a column into a Pandas data frame. I find this off, as I pretty much come to Pandas for a structure I can easily add and remove columns from.
Let’s work an example.
'1.3.0'
# define some experiment parameters
nrow = 100 # number of rows to generate
ncol = 100 # number of columns to generate
nreps = 100 # number of repetitions in timing
First we try an example that simulates what might happen in the case of a data scientist working with a data frame. Some columns get added. In this place all at once all one place, as this is just simulation code.
# define our first function: adding columns as user might
def f_insert():
d = pandas.DataFrame({
'y': numpy.zeros(nrow)
})
for i in range(ncol):
d['var_' + str(i).zfill(4)] = numpy.zeros(nrow)
return d
<ipython-input-4-aedae30a984f>:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use `newframe = frame.copy()`
d['var_' + str(i).zfill(4)] = numpy.zeros(nrow)
2.707611405
The above warning only occurred once in this context. In other applications I have seen it repeat very many times overwhelming the worksheet. I guess I could add %%capture
to each and every cell to try and work around this.
The demanded alternative is something like the following.
# switch to Pandas's insisted upon alternative
def f_concat():
d = pandas.DataFrame({
'y': numpy.zeros(nrow)
})
return pandas.concat(
[d] +
[pandas.DataFrame({'var_' + str(i).zfill(4): numpy.zeros(nrow)}) for i in range(ncol)],
axis=1
)
1.4440117099999998
Yes, concat is faster- but it is only natural in artificial cases such as the above where I am adding all the columns in a single place. So is any sequence inserts in Pandas now a ticking time bomb that will spill warnings out once some threshold is crossed?
I guess one could keep a dictionary map of column names to numpy 1-d arrays and work with that if one wants a column oriented data structure.
(note: how to suppress this warning can be found here: https://github.com/twopirllc/pandas-ta/issues/340#issuecomment-879450854 )
]]>For a few years we’ve been running a data science intensive at for a really neat FAAMG company. The idea is to give engineers some hands on live workbook time using methods varying from linear regression, xgboost, to deep neural networks. Learning how participants progress and internalize the
concepts is amazing.
Obviously the pandemic put us into a remote mode.
This is my current pseudo-studio.
My current minimal setup includes:
"5" == 5
should be, and perhaps even a guess as to what the answer is in their most familiar programming language.
In my opinion SQL gets it right. For example, we get the following in Google BigQuery.
SELECT "5" = 5 -- No matching signature for operator = for argument types: STRING, INT64. Supported signature: ANY = ANY at [1:8]
That is a nice safe early error that can prevent a lot of confusing data processing bugs down the line. This may be clearer in the related example.
SELECT "5" IN (1, 2, 3) -- No matching signature for operator IN for argument types literal STRING and {INT64} at [1:12]
This follows from the expectation in SQL that columns and lists have homogeneous types.
In this context: it is likely the user meant SELECT 5 IN (1, 2, 3)
, i.e. to check if an integer was in a given set of integers. And it is unlikely the user actually meant SELECT "5" IN (1, 2, 3)
. The second form can never be true in SQL, by simple type inspection, so it is useful to have it disallowed. Violations of expectations are caught and thus easy to find and avoid.
Now let’s try this in a few more languages.
For fun, from memory, what is the outcome of evaluating the expression "5" == 5
in Python3?
It turns out it is False
, which is useful and understandable in a general purpose programming language. Python deals with heterogeneous lists and sets. Thus, "5" == 5
and "5" in {1, 2, 3}
are sensible expressions, given the language context.
And now we get to the odd one.
R is a language I love and routinely work in. Heck, I even wrote a book about working in it, which I am very proud of.
However, I can’t defend R’s return value for "5" == 5
. This turns out to be TRUE
. One can, of course, guess at what sort of implicit casting is supporting this, but R isn’t a language where strings and numbers are generally equivalent. Yet we have 5 %in% c("5")
evaluating to TRUE
. R mostly enforces homogeneous types, but it does so by quiet implicit type conversion (a buggy gift that keeps on giving). A well informed R user expects c(5, "5")
to be a vector of strings; I am less convinced many expect "5" == 5
to evaluate to TRUE
.
One would expect equal values can always support the same operations. However, in R, 5 + 1
is a sensible expression and "5" + 1
is not. So, I find it hard to argue 5
and "5"
are equivalent values.
And that is our programming foible for the day.
]]>I’d like to work an example of using SQL WITH Common Table Expressions to produce more legible
SQL.
A major complaint with SQL is that it composes statements by right-ward nesting.
That is: a sequence of operations A -> B -> C
is represented as SELECT C FROM SELECT B FROM SELECT A
. However, the SQL 99 standard introduced the WITH
statement and common table expressions (ref). WITH
statements allow forward composition.
Let’s take a look at asking the data_algebra
(ref) to emit SQL with and without common table expressions.
First we set up some example data.
import sqlite3
from data_algebra.data_ops import *
import data_algebra.test_util
import data_algebra.SQLite
d = data_algebra.default_data_model.pd.DataFrame({
'x': [1, 2, 3]
})
d
x | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
Next we set up our calculations. Please note the order they are performed.
ops = describe_table(d, table_name='d') .\
extend({'z': 'x + 1'}) .\
extend({'q': 'z + 2'}) .\
extend({'h': 'q + 3'})
ops
TableDescription( table_name='d', column_names=[ 'x']) .\ extend({ 'z': 'x + 1'}) .\ extend({ 'q': 'z + 2'}) .\ extend({ 'h': 'q + 3'})
res_pandas = ops.transform(d)
res_pandas
x | z | q | h | |
---|---|---|---|---|
0 | 1 | 2 | 4 | 7 |
1 | 2 | 3 | 5 | 8 |
2 | 3 | 4 | 6 | 9 |
expect = data_algebra.default_data_model.pd.DataFrame({
'x': [1, 2, 3],
'z': [2, 3, 4],
'q': [4, 5, 6],
'h': [7, 8, 9]
})
assert data_algebra.test_util.equivalent_frames(res_pandas, expect)
db_model = data_algebra.SQLite.SQLiteModel()
with sqlite3.connect(":memory:") as conn:
db_model.prepare_connection(conn)
db_handle = db_model.db_handle(conn)
db_handle.insert_table(d, table_name='d')
sql_regular = db_handle.to_sql(ops, pretty=True, use_with=False, annotate=True)
res_regular = db_handle.read_query(sql_regular)
sql_with = db_handle.to_sql(ops, pretty=True, use_with=True, annotate=True)
res_with = db_handle.read_query(sql_with)
assert data_algebra.test_util.equivalent_frames(res_regular, expect)
assert data_algebra.test_util.equivalent_frames(res_with, expect)
The standard nested SQL for these operations looks like the following.
print(sql_regular)
SELECT -- extend({ 'h': 'q + 3'}) "x", "z", "q", "q" + 3 AS "h" FROM (SELECT -- extend({ 'q': 'z + 2'}) "x", "z", "z" + 2 AS "q" FROM (SELECT -- extend({ 'z': 'x + 1'}) "x", "x" + 1 AS "z" FROM "d") "extend_0") "extend_1"
Notice the variables are generated in reverse order (h, q, z instead lf z, q, h).
The common table expression version looks like this, which involves less nesting and values move forward notation.
print(sql_with)
WITH "extend_0" AS (SELECT -- extend({ 'z': 'x + 1'}) "x", "x" + 1 AS "z" FROM "d"), "extend_1" AS (SELECT -- extend({ 'q': 'z + 2'}) "x", "z", "z" + 2 AS "q" FROM "extend_0") SELECT -- extend({ 'h': 'q + 3'}) "x", "z", "q", "q" + 3 AS "h" FROM "extend_1"
It is interesting to note when WITH
or common table expressions became widely available. The Wikipedia has the versions (and hence dates) (ref) showing when common table expressions are supported in the following
databases.
Some of the cost of implementing common table expressions, is they are where databases allow recursive or fixed-point semantic extensions. From the database point of view these are major semantic changes, not mere notational conveniences.
]]>
The data algebra is a modern realization of elements of Codd’s 1969 relational model for data wrangling (see also Codd’s 12 rules).
The idea is: most data manipulation tasks can usefully be broken down into a small number of fundamental data transforms plus composition. In Codd’s initial writeup, composition was expressed using standard mathematical operator notation. For “modern” realizations one wants to use a composition notation that is natural for the language you are working in. For Python the natural composition notation is method dispatch.
The problems with the relational model were two fold:
The data algebra implements the Codd transforms (using Codd’s names where practical) in Python. It can manipulate data in Pandas or SQL. Such a strategy is famously used in the dplyr / dbplyr R packages (which use a pipe operator for composition, as R native S3/S4 method dispatch is again through somewhat illegible nesting).
The benefits / purposes of the data algebra include:
Here is a simple data algebra example (source here).
First let’s import our packages and set up an example data frame.
import pandas
from data_algebra.data_ops import *
d = pandas.DataFrame({
'c': ['c', 'c', 'b', 'a'],
'v': [1, 2, 3, 4],
})
d
c | v | |
---|---|---|
0 | c | 1 |
1 | c | 2 |
2 | b | 3 |
3 | a | 4 |
Now let’s define our data transform using the data algebra. New columns are defined by specifying a Python dictionary where new column names are the keys and the source-code for the operations are the values. We try to use Codd’s names for operators: adding columns is extend()
, and summarizing data is project()
.
table_name = 'data-algebra-test.test_1.d'
operations = describe_table(d, table_name=table_name) .\
extend({
'g': '"prefix_" %+% c' # concatenate strings
}) .\
project({ # build per- group g totals of v
'group_total': 'v.sum()'
},
group_by=['g']
) .\
order_rows(['g']) # choose a presentation order of rows
We can then apply these operations to any data frame that has the columns specified in the table description (and appropriate column types).
res_pandas = operations.transform(d)
res_pandas
g | group_total | |
---|---|---|
0 | prefix_a | 4 |
1 | prefix_b | 3 |
2 | prefix_c | 3 |
Applying the same operations in a database is quite simple. First we connect to our database. Here we are inserting the data as an example, in serious applications the source table would usually already be present.
import os
from google.cloud import bigquery
import data_algebra.BigQuery
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/johnmount/big_query/big_query_jm.json"
biqquery_handle = data_algebra.BigQuery.BigQueryModel().db_handle(bigquery.Client())
biqquery_handle.insert_table(d, table_name=table_name, allow_overwrite=True)
Then we can generate SQL tailored to the specific database.
bigquery_sql = biqquery_handle.to_sql(operations, pretty=True)
print(bigquery_sql)
SELECT `g`, `group_total` FROM (SELECT `g`, SUM(`v`) AS `group_total` FROM (SELECT ("prefix_" || `c`) AS `g`, `v` FROM `data-algebra-test.test_1.d`) `extend_1` GROUP BY `g`) `project_2` ORDER BY `g`
Operations can be used to land results in the database (using CREATE TABLE AS
, avoiding round-tripping data in and out of the database). Operations can as be used to return results as a Pandas data frame.
res_bigquery = biqquery_handle.read_query(operations)
res_bigquery
g | group_total | |
---|---|---|
0 | prefix_a | 4 |
1 | prefix_b | 3 |
2 | prefix_c | 3 |
And we can check we get equivilent results in Pandas and from the database.
assert res_pandas.equals(res_bigquery)
biqquery_handle.close()
We can repeat the database example using another database, simply by building a different database handle.
import sqlite3
import data_algebra.SQLite
sqlite_handle = data_algebra.SQLite.SQLiteModel().db_handle(sqlite3.connect(":memory:"))
sqlite_sql = sqlite_handle.to_sql(operations, pretty=True)
print(sqlite_sql)
SELECT "g", "group_total" FROM (SELECT "g", SUM("v") AS "group_total" FROM (SELECT ('prefix_' || "c") AS "g", "v" FROM "data-algebra-test.test_1.d") "extend_1" GROUP BY "g") "project_2" ORDER BY "g"
sqlite_handle.insert_table(d, table_name=table_name, allow_overwrite=True)
res_sqlite = sqlite_handle.read_query(operations)
res_sqlite
g | group_total | |
---|---|---|
0 | prefix_a | 4 |
1 | prefix_b | 3 |
2 | prefix_c | 3 |
assert res_sqlite.equals(res_bigquery)
Also, operations have a pretty good printing method.
operations
TableDescription( table_name='data-algebra-test.test_1.d', column_names=[ 'c', 'v']) .\ extend({ 'g': "'prefix_'.concat(c)"}) .\ project({ 'group_total': 'v.sum()'}, group_by=['g']) .\ order_rows(['g'])
And that is a small demonstration of the data algebra.
Version 0.7.0 is a major upgrade. The improvements include:
.eval()
and .query()
. Previous versions of the data algebra tried to dispatch expression evaluation to Pandas through the .eval()
and .query()
interfaces. These interfaces have proven to be fairly limited, and not how most users use Pandas. data algebra now directly manages expression evaluation over Pandas columns.The data algebra is a great tool for Python data science projects. We are thrilled it has gotten to the point where we use it in client projects. What is missing is a “data algebra manual” and training, but with luck we hope to someday fill that gap.
]]>