Chess psychometrics – The gender equality paradox

Chess databases contain millions of games, whose players can largely be identified by the Fide players database [1] which contains age, sex, nationality and ratings. These games are interactions providing information about behavior in a competitive context. They are a goldmine for psychological or sociological research into a wide range of topics. The datasets derivable from chess databases are much larger than what can be realistically achieved in typical psychological research. As a researcher you are really only limited by your imagination and the number of your grad students.

While the typical university professor is severely limited in the former, I am unfortunately limited in the later. So, we will see how many of my chess psychometrics projects I’ll be able to bring to completion. For now we’ll start with something simple and not very original: We will check whether the gender equality paradox holds in chess.

The gender equality paradox is the observation that women in more gender equal societies tend to choose more stereotypical female occupations and are less likely for example to go into STEM. The gender disbalance in chess is very comparable to the disbalance in STEM research [2]. In fact, it is usually even more extreme, with women in many countries only representing less than 5% of the pool of rated players.

Outside of developed countries, the number of rated players is often quite small and not very representative when it comes to age, rating or possibly sex. So it is not surprising that on a global level we find no correlation between the global gender gap index and the fraction of female players.

In European countries however, there is a significant negative correlation between gender equality and the fraction of female players. (Yes, Turkey is for some reason in my list of European countries.)

Pearson correlation: -0.43603692297301305, p-value: 0.01119225533794084

This looks like a straightforward result. However, I am generally skeptical about the significance of these kind of correlations, because I suspect that often the significance is a result of countries falling into a small number of similar behaving clusters. If these clusters are then arranged linearly by chance, we get a significant correlation by virtue of decomposing these clusters into many countries.

So it might be the case that Northern countries all have high gender equality and low female chess player fractions by chance. While Eastern Europeans have low gender equality and high female chess participation for historical reasons. Because these clusters and the rest of the countries in between constitute a lot of observations the results looks a lot more robust than it really is.

Sure enough, there is no such correlation in Eastern Europe and restricted to Western Europe the correlation looses all significance.

On the other hand, the loss of statistical significance is due to just two outliers: Iceland and France. So is the gender equality paradox a thing in chess or not?

To detect even a rather weak tendency, we average over all countries that fall into the same section of the GGG-index. This time we look at all countries in our dataset.

If we ignore the four countries with the lowest gender equality which average very low, we actually see a nice downward trend in female chess playing the higher the gender equality. I tentatively conclude that the gender equality paradox does actually exist in chess.

[1] Fide player database

[2] The Gender-Equality Paradox in STEM Education


Demographic Change in France: Discussion

The typical rightwing theory would be that the immigrant population is outbreeding the natives due to much higher birthrates. Is this the story behind the sickle cell data? If we translate the percentages into absolute numbers based on the number of births in each relevant year, we get absolute births:
2000: 778900,
2007: 785985
2010: 802224
2012: 790290
2013: 781621
2015: 760421
Sickle cell tested newborns:
2000: 147991
2007: 223613
2010: 252701
2012: 272176
2013: 279039
2015: 295804
Not sickle cell tested newborns:
2000: 630909
2007: 562372
2010: 549523
2012: 518114
2013: 502582
2015: 464617

This is a growth of 4.72% per year and a shrinking of -2.02% per year respectively. Let’s imagine there was a French population and an immigrant population established maybe in the 60ies. And both populations were breeding merrily away with a perfectly steady fertility rate. In this closed system, what kind of fertility rates would account for the growth rates we see?

Well, with a generation length of 30 years (and human generation lengths almost always fall close to 30 years even with very different fertility rates), a fertility rate of 7.97 kids per woman for the immigrant population and 1.08 kids per woman for the French population would lead exactly to the growth rates we calculated.

That’s of course insane. According to the studies I have seen on this topic, immigrant birth rates have never been close to 8 kids per woman and these days they are certainly far lower. The French fertility rate of 1.08 kids per woman is also crazy low, because we made the assumption of a steady rate over many generations. If the rate was higher in the past, today’s birth rate would have to be even lower to account for the decline. Or conversely, if the birthrate today was actually higher, it must have been below 1.0 in the past.

So how do we square the circle? The sickle cell birth rate increase has to be predominantly driven by recent immigration. That fits the numbers. Birthrates among immigrant usually drop relatively quickly towards the birth rate typical of the country. Legal immigration to France has been massive in recent years. And there is also illegal immigration, estimated by Wikipedia to lie between 80,000 and 100,000 per year [1]. The story is probably one of young immigrants coming to France and then having kids, which also implies that a total immigration stop might lead to a reduction of the percentage of sickle cell babies.

But how can the native French birth rate be this low? The answer is probably that there is a certain amount of intermarriage, which get’s counted for the sickle cell numbers, and the overall fertility rate of ethnic French women is pretty close to that in neighbouring European countries, maybe in the vicinity of 1.4.

By the way, the current rates of growth and decline predict parity in sickle cell births and non-sickle cell births in 2022 and a 66% majority for the former in 2032. Of course, for reliable predictions a more sophisticated model is needed than just extrapolating growth rates.

I am not somebody to grieve for the French genepool, but I think this rapid change is dangerous for a variety of reasons.

It seems probably that within a decade or two, most French people will wake up to a reality were France is still 70% white, but the future is very noticeably 70% black. How will they react?

If the relative growth rates hold and at some point the political power changes hands, how will that effect the ethnic French? Even in a very peaceful best case scenario the new government will have been brought into power by a electorate much younger than the opposition, and with wages and pensions much lower than those of the opposition. In this situation a massive cut in pensions is the logical result in a democracy.

Or maybe the percentage stabilizes somewhere and French and Africans just have to live side by side. Well, the Basques, the Northern Irish, the Ukrainians and all inhabitants of Balkan states will tell you that even in Europe, having different ethnic groups in one country is not a recipe for peace. What does the trouble in the banlieues look like if scaled up five-fold?

I am not sure how big the achievement gap between second generation immigrant and the ethnic French is. But if there is a significant gap, the massive influx of lower qualified workers into the labour market will retard economic growth. That’s not gonna work wonders for ethnic relations. All in all a very worrying development, with France not the only European country in which rapid demographic change might lead to major upheaval in the next decades.

[1] Illegal Immigration to France

Demographic Change in France: The Numbers

A few years back certain medical data made a big wave in right wing circles never quite spilling over into mainstream media. The data in question consisted of percentages of newborns tested for sickle cell anemia in mainland France. In France, only newborns that have at least one parent originating from a region in which sickle cell anemia is common are tested for the disease. As sickle cell anemia is mostly prevalent in Africa, these percentages where taken as stand-in for the percentage of French newborns of African heritage.

The screening data suggested, that in 2000, 19 percent of babies born in mainland France (excluding oversea departments) were of African origin, a number that rose steadily to 38.9 percent in 2015. This is certainly surprising. If these numbers are correct, France’s ethnic makeup seems poised to jump from entirely European to basically Brazil within two generations.

To me, these data are worth investigating for several reasons.

During the last decades the media fed us a steady diet of articles about the French family friendly policies that were the reason for the birth rate collapse failing to materialize in France. It would certainly be interesting if that was just nonsense and the real reason was a more fecund (or just bigger) class of immigrants.

Ethnic replacement is a centerpiece of rightwing agitation. Of course, the media tells us that it is just a conspiracy theory. Just as with the French birth rate, I am very much interested in the extend of lies told to me by mainstream media outlets. Call it a desire for informational emancipation.

Ethnicity correlates with lots of variables of interest. Quantifying such a rapid change would allow predictions in crime rates, economic growth, human capital, unemployment, etc. Rapid change of any sort is often accompanied with many dangers. If you don’t know about the change, you can’t look out for the dangers.

There are several arguments against equating sickle cell screening with African origin. Among the countries that provided significant numbers of immigrants to France, sickle cell anemia is prevalent in Italy, Greece and Turkey aside from the Maghreb, Subsaharan Africa and the Caribbean. However, the number of recent European immigrants from sickle cell regions is too small to account for more than a few percentage points.

It has also been argued that some hospitals do not distinguish by origin, but instead test all newborns. That is entirely possible, however, it leads to a dilemma. If the absolute number of newborns at risk for sickle cell anemia is overestimated, the growth rate has to be underestimated!
Or to put it differently: If the 19% in 2000 were actually just 10% because 9% were due to unnecessary testing, than to get to 39% in 2015 the percentage of actual kids at risk had to triple from 10% to 30% instead of double from 19% to 39%.

Alternatively, the number of hospitals just testing everybody has steadily risen. In which case the entire data is worthless. Or the original study could just be a hoax by a devious far-right physician. Who knows?

So the first point on our agenda is trying to independently verify the plausibility of the data.

To this end, I downloaded the data for given names in France provided by the French bureau for statistics, INSEE [1]. I also create a list of 2211 popular Muslim names, specifically Arab and Turkish names. Not all of the sickle cell tested babies will be of Arab or Turkish origin. And not all kids of Arab and Turkish origin will be given Arab and Turkish names. And additionally, my list probably doesn’t cover more than small chunk of all actual Arab and Turkish names. But it still allows us to track the increase of a certain subset of all kids that would be subject to sickle cell testing.

A first quick and dirty run of the numbers: In 2000, out of 800039 kids my list covers 46718 or 5.8%. In 2015, my list covers 80387 out of 777746 names, or 10.3%. This amount to an estimated 1.77-fold increase of Arab/Turkish newborns over the time span in which the sickle cell percentage roughly doubled, which is reasonably close.

However, out of my 2211 names only 103 and 127 actually occur in the INSEE list of given names for the years 2000 and 2015. Only 77 names occur in both lists. Some of these names are clearly not just popular among Muslims, especially girl’s names are often ambiguous. So let’s try to tighten up the method.

Now, we only look at names present in both years. We exclude all ambiguous names. Each remaining name provides a separate estimate how much the percentage of Muslim newborns has changed between 2000 and 2015. This time the overall percentage accounted for by these 56 names almost exactly doubles from 1.95% to 3.89%. The median increase, which should be more robust against outliers (like short term trends in popularity), is also exactly 2.0.

To my mind this provides strong confirmation that the sickle cell data is correctly interpreted as showing that the percentage of a predominantly African derived immigrant population among the newborns in France has doubled between 2000 and 2015. Confirmation of the growth rate makes it rather unlikely that the absolute percentage numbers are off by any significant degree.

I did these analyses quite some time ago. At one point I became aware that my given name analysis had been scooped by a French far-right website. (Which was one motivation to finally get the blog going.) In their analysis they try to capture all Muslim names and give a definite estimate of the absolute numbers. They handle ambiguous names by just counting them as half a Muslim. According to their analysis the number of Muslim newborns more than doubled between 2000 and 2015.

This got me thinking about how to do this analysis right. Counting ambiguous names as half is a really ugly hack, likely to overcount names as long as Muslims are a minority. Instead one might use the regional and temporal variation to infer for each name separately how it contributes to the number of Muslims.

Once you have done that you can subtract a precise estimate of number of Muslims from the sickle cell data to get an estimate of the increase of Subsaharan Africans for each region. Which allows you to do the same inference for SS-African names, which are probably much more ambiguous than the Islamic ones.

If that works, you have ended up with a method to create precise estimates for both groups directly from given names, even in the likely case that the sickle cell data stops being published. Unfortunately this takes quite a lot of time. And of course there is no guarantee that it would work. Maybe a project for the future.


Four problems with cousin marriage

Cousin marriage was prevalent all over the world with the big exception of western Europe [1]. It still is especially common in the Islamic world. Marrying your relatives has the advantage of keeping the family together. Clans of up to several hundred closely related persons are the result and especially in a pre-state context, that is a pretty useful organisational unit. However, from a genetic or evolutionary perspective there are several potential problems with cousin marriage.

The obvious one is the prevalence of homozygosity runs, i.e. sections in the genome that are identical for the chromosome coming from the father and the chromosome received from the mother. These are generally problematic, because the other chromosome copy has to step in whenever something is significantly messed up in one chromosome. Homozygosity runs mean that mutational load hits with full force for some sections of the genome. There is probably an IQ loss of several points and congenital diseases become much more common.

However, all problems caused by homozygosity runs can be fixed by a single outbreeding event. But what if living in a clan environment has reduced the selection for individual achievement in a population for hundreds of years? The welfare state is often blamed for the reduced or reversed selection for positive traits, but a clan is a form of welfare state. The clan provides you with a job, a wife, takes care of you when you fall ill or lose your ability to feed your kids. It is conceivable that the existence of clan structures prevented the slow replacement of the lower class by the middle class that is conjectured to have raised the IQ in Europe until the nineteenth century [2].

Clan borders also work to a certain degree as genetic barriers. This means that positive mutations have a much harder time sweeping the population. If the default is marrying a relative, a positive mutation will have to sweep each clan separately and additionally jump from clan to clan.

The fourth potential problem I see is a reduced response to selection. Response to selection depends on the variance of the trait in question. Variance within each clan (not necessarily within the full population) will be lower for two reasons: The reduced genetic diversity and less assortative mating. Depending on how the selection pressure is structured this might reduce the speed of genetic adaptation.

It is possible that these four factors played a role in the precipitous fall of intellectual productivity that the Islamic world has experienced since the Islamic Golden Age [3].

[1] Cousin marriage Europe

[2] Farewell to alms

[3] Cousin marriage Middle East

Too much of a good thing

The gay germ theory proposes that there is a pathogen that causes male homosexuality [1]. The reasoning behind this theory is that the fitness hit of homosexuality is too big to allow a genetic explanation. I.e. any genetic variation involved would very quickly be weeded from the gene pool and the occurrence of homosexuality would depend on de novo mutations and be very rare. Pathogens on the other hand, can do with us whatever they want to, because we cannot out-evolve them.

However, there are several indications, that an immune reaction of the mother is somehow involved [2]. Here, the idea would be that embryonic tissue that expresses proteins alien to the immune system is attacked and damaged. This would explain a underdevelopment of a hypothalamic nucleus responsible for a male heterosexual orientation, which expresses male specific genes during masculinization and defeminization in utero.

If we combine both ideas, we end up with a putative pathogen, that might trigger an immune response against male specific proteins. This gives us a hint how to identify the pathogen in question: It would have to be a pathogen that shares an epitope with such a male specific protein. An epitope is a surface of a folded protein, which is recognized by the immune system’s antibody.

Unfortunately identifying epitopes from protein sequences requires a solution to the protein folding problem. So we are not going to be able to do it just by downloading a bunch of genetic sequences.

However, it is entirely possible that there isn’t actually a pathogen involved. The effect of shared environment on sexual orientation, for example, is zero [3]. Pretty weird for an infection. Instead male homosexuality might be a case of what I like to call the “too much of a good thing”-failure mode of evolution. Occasionally evolution finds itself in dead ends, where there is a selection pressure towards a good thing, and a catastrophic failure mode whenever there is too much of the good thing.

One example for this are the trisomies. Here the good thing is having a big ovum [4]. Human egg cells are pretty big and for good reason. After fertilization they have to divide quickly and set up shop in the uterus. If they run out of gas before the placenta is in place, that’s it. One way ova get big is very unequal division in the last two rounds of cell division. One new cell keeps all the cell plasma, the other one is discarded. Very unequal cell division results in a big final cell, but it also increases the likelihood that not all of half of the chromosomes can be stashed in the small cell.

Another example might be autism. One of the symptoms of autism is neural overgrowth in some parts of the cortex [5]. And while the average autist does not do too well on an IQ test, extreme precociousness in children is often accompanied with autism. It seems growing too many neurons and learning too fast has catastrophic failure modes too.

This could be going on with male homosexuality. A strong immune system is nice, but not if it attacks vital parts of you unborn child’s brain. Or conversely, toning down the immune system to accommodate your child is a fine thing, if it doesn’t get one or both of you killed. One could argue that evolution should find some sideway avenue to avoid the failure mode. But it didn’t for the trisomies, nor did it for autism.

[1] Gay germ theory

[2] Antibodies against male specific proteins in mothers of gay sons

[3] Shared environment of homosexuality is zero.

[4] The ovum is large.

[5] Brain overgrowth in autism

Counting names

In the last blogpost we counted a lot of names to determine how NE-Asians are represented among songwriting competition winners. In this post we take a closer look at some methods to determine ethnic composition of a dataset by surname analysis.

Without the anatomical baggage, the theory of NE-Asian relative underperformance boils down to “whatever Ashkenazim have that makes them successful (beyond the absolute IQ-value), NE-Asians probably have slightly less of it than Europeans”.

So we are naturally interested in the Ashkenazi representation among songwriting competition winners. We try to calculate it using two datasets of Ashkenazi names [1],[2]. Using the first set of names we count 17% percent Ashkenazi names among songwriting competition winners, using the second set of names we only find 6.5%. This mostly tells us that we cannot reliably assess the number of Ashkenazim with this method. There are too many names listed that are quite common among gentiles as well. And on the other hand, it is unclear how complete the lists are.

There has been much speculation, whether Ashkenazi intellectual performance is declining due to outmarriage and low fertility. Even if our method is too crude to give precise percentages, we can at least say that there is no evidence of declining performance in this dataset.

It seems counting Ashkenazim is methodologically significantly more difficult than counting NE-Asians. And of course the same holds true for African-Americans or Hispanics, who share a lot of surnames with White Americans.

Counting Ashkenazim is a time-honoured method that has been used to advance many different theories. Due to the Ashkenazi IQ advantage of 10 points, Ashkenazi overrepresentation can, for example, be used to assess the intellectual difficulty of a feat. For very high profile samples it is possible to check ethnicity by hand. For the NE-Asian songwriters this was only possible because a look at a picture is enough. Generally this is not possible for Ashkenazim which leads to a lot of potential data fudging. We would really like to do better than that.

How do we accurately and objectively assess the likely number of Ashkenazim in a sample of surnames? Let’s assume we have the probability distribution of Ashkenazi family names, i.e. the frequency with which each name appears in the Ashkenazi population. We also have the surname distribution in the general US population. Then we can create a mixed distribution of x% Ashkenazim and 100-x% non-Ashkenazim and calculate the likelihood of our sample given this mixed distribution. The likelihood is just the product of frequencies of the names in the sample. By doing this for different x we can find the mixed distribution that leads to the maximum likelihood of our sample. The x% used to create this maximum likelihood distribution is our best guess at the Ashkenazi percentage in our sample. Possibly, we have to adjust the percentage by the fraction of the population that is actually covered by our name database.

It may seem easier to just add the fractions of Ashkenazim for each name in the sample. So if we have ten names and in the general population for each of these names 10% are Ashkenazim, these ten names add up to one full Ashkenazim. Unfortunately, this undervalues overrepresented groups. In a sample with five-fold overrepresentation, these ten names should be treated as 50% Ashkkenazi. One idea would be to calculate one estimate, use this estimate to update the expected fraction of Ashkenazim for each name and then recursively get closer to the real value, but the maximum likelihood method also gives us a distribution of likely percentages.

The 2010 US census provides the surname distributions for the general US population, Whites, Blacks, Asians and Hispanics [3]. This allows us to test this method at least for these groups. For the songwriting competition winners it results in a maximum likelihood for an Asian percentage of 0.8%. This would correspond to 23 artists. Given that we only looked at NE-Asians in the last blogpost, this result is very compatible with our earlier result.

Likelihood for percentage of non-Asians peaks at 99.2%.

So we do have a method and we do have a motive, but unfortunately we don’t have a distribution of Ashkenazi surnames in the US. One way to get a distribution of Ashkenazi surnames would be by scraping the names of Holocaust victims from the Vad Vachem database [4]. However, that is certainly somewhat disrespectful and it is unclear whether the distribution is all that similar to the current American one. The other way is to collect names of American Jews from a variety of sources, here a hundred and there a hundred, until a significant fraction of the most common names is covered. However, I am currently too lazy to do that. But maybe I will get around to it, if something a lot more interesting than songwriting competition winners turns up.

[1] Ashkenazi names 1

[2] Ashkenazi names 2

[3] 2010 US Census

[4] Holocaust victim names

Verbal IQ and songwriting – NE-Asian underperformance

In my two part blog post “A theory of intelligence” I examine the unusual IQ profiles of both Ashkenazi Jews (high verbal) and NE-Asians (high math-spatial) to propose a theory of intelligence. This theory tries to explain the NE-Asian underperformance in GDP and science relative to their very high IQ, by positing that NE-Asians create fewer lateral and top-down synapses. This leads to slightly lower verbal IQ and conceptual creativity compared to Europeans and especially compared to Ashkenazim.

One of my intuitions is that verbal IQ tests do not pick up on this difference particularly well, because they also load on knowledge and pattern recognition. I wondered whether tail effects in verbally creative endeavors would maybe lend support to my theory. To this end I analyzed a dataset of songwriting competition winners [1].

The dataset consists of 2875 US artists that won prizes or honorary mentions in the years 2002-2017. To identify NE-Asian artists I compare the names against the most common Korean [2], Japanese [3] and Chinese [4] surnames. These surnames cover roughly 90%, 33% and 84.8% of these populations respectively. Each hit I then check by hand to exclude anybody who is provably not Asian (quite likely for some names like Young, Shaw or Lee).

Chinese Americans constitute 1.5% of the US population, Japanese Americans 0.4% and Korean Americans 0.8%. Multiplied with the sensitivity of our method, this leads to (0.0150.848 + 0.0040.33 + 0.0080.90)2875= 61 being the expected number of hits for a perfectly proportional representation of NE-Asian Americans.

Instead we only find 13 NE-Asian names that cannot be excluded, more than a four-fold underrepresentation. Of course, one may argue that this is a result of language deficiencies due to relatively recent immigration. However, there is also no upward trend visible over these 15 years. Only seven of these artists are unambiguously Japanese, Chinese or Korean American. Of the rest, one is Japanese but not American, one is Malaysian, one is Taiwanese (not included in the 1.5%) and four I could not identify.

Also, perfectly proportional representation may be the wrong baseline to compare against. Of the 1554 winners of the US Open Music Competition 2019 [5], a competition in classical music, 1050 have NE-Asian surnames. These also have much more typical names, with the most common being Yang, Wang, Chen, Li, Truong, Zhang, Liu, Loo, Wu, Lin. This makes it plausible that we are still overcounting NE-Asians in the songwriting dataset.

There is the additional fudge factor, that you can’t tell who wrote the lyrics. Christopher Tin, for example, whom we counted twice, is a classical composer. His most famous piece is Baba Yetu, the theme song of Civilisation IV. Its lyrics are a Swahili version of the Lord’s Prayer.

Overall we see at least a 4.5-fold underrepresentation relative to population percentage, compared to classical music winners a 150-fold underrepresentation. I chalk this up as consistent with my theory.

[1] International songwriting competition

[2] Korean Surnames

[3] Japanese surnames

[4] Chinese surnames

[5] US Open Music Competition Winners