Daniel Lemire's blog – Telegram

Daniel Lemire's blog

408 subscribers

70 photos

342 links

This channel can be used to follow Daniel Lemire's blog. It is COVID-free. If you would like news on COVID, follow https://xn--r1a.website/covidinfoenglish

Download Telegram

About

Blog

Apps

Platform

Daniel Lemire's blog

408 subscribers

Daniel Lemire's blog

Converting integers to fix-digit representations quickly

It is tricky to convert integers into strings because the number of characters can vary according to the amplitude of the integer. The integer ‘1’ requires a single character whereas the integer ‘100’ requires three characters. So a solution might possible need a hard-to-predict branch. Let us simplify the problem. Imagine that you want to serialize integers to fixed-digit strings. Thus you may want to convert 16-digit integers (up to 10000000000000000) to exactly 16 digits, including leading zeros if needed. In this manner, it is easy to write code that contains only trivial branches. The simplest approach could be a character-by-character routine where I use the fact that the character ‘0’ in ASCII is just 0x30 (in hexadecimal): void to_string_backlinear(uint64_t x, char *out) { for(int z = 0; z < 16; z++) { out[15-z] = (x % 10) + 0x30; x /= 10; } } It is somewhat strange to write the characters backward, starting from the less significant digit. You can try to go forward, but it is a bit trickier. Here is one ugly approach that is probably…

https://lemire.me/blog/2021/11/18/converting-integers-to-fix-digit-representations-quickly/

354 views03:57

Daniel Lemire's blog

Are tenured professors more likely to speak freely?

University professors often have robust job security after a time: they receive tenure. It means that they usually do not have to worry about applying for a new job after a few years. Tenure is not available in all countries. Countries like Australia reassess positions every few years. So why does it exist where it does? One of the justifications for tenure is that professors who have tenure can speak more freely. Thus, in theory, they can be critical of government or corporate policies. Do they? What would “speaking freely” entails? What about denouncing a colleague who commits blatant fraud? On this front, the evidence is not great. Diederik Stapel published well over 100 research papers in prestigious journals. He was fired when it was determined that he was making up all of his research data. It took outsiders (students) to report him. Harvard professor Marc Hauser published over 200 papers in the best journals, making up data as he went. It took naive students to report the fraud. We find too many examples of over fraud in science, and rarely do we find…

https://lemire.me/blog/2021/11/27/are-tenured-professors-more-likely-to-speak-freely/

Daniel Lemire's blog

Are tenured professors more likely to speak freely?

University professors often have robust job security after a time: they receive tenure. It means that they usually do not have to worry about applying for a new job after a few years. Tenure is not available in all countries. Countries like Australia reassess…

301 views01:49

Daniel Lemire's blog

Science and Technology links (Novembre 28th 2021)

Government-funded research is getting more political and less diverse: The frequency of documents containing highly politicized terms has been increasing consistently over the last three decades. The most politicized field is Education & Human Resources. The least are Mathematical & Physical Sciences and Computer & Information Science & Engineering, although even they are significantly more politicized than any field was in 1990. At the same time, abstracts have been becoming more similar to each other over time. Taken together, the results imply that there has been a politicization of scientific funding in the US in recent years and a decrease in the diversity of ideas supported. Parabiosis is the process of tying the blood vessels of animals. Zhang et al. proceeded with parabiosis between young and old mice, followed by a detachment period. The old mice lived longer than control mice and they appear to have been rejuvenated by the parabiosis. A single injection to enable paralyzed mice to walk again. The Artic has been warming for much longer than we thought. Dog fed only once daily are healthier. India fertility…

https://lemire.me/blog/2021/11/28/science-and-technology-links-novembre-28th-2021/

Daniel Lemire's blog

Science and Technology links (Novembre 28th 2021)

Government-funded research is getting more political and less diverse: The frequency of documents containing highly politicized terms has been increasing consistently over the last three decades. The most politicized field is Education & Human Resources.…

308 views18:17

Daniel Lemire's blog

Science and Technology links (Novembre 28th 2021)

Government-funded research is getting more political and less diverse: The frequency of documents containing highly politicized terms has been increasing consistently over the last three decades. The most politicized field is Education & Human Resources. The least are Mathematical & Physical Sciences and Computer & Information Science & Engineering, although even they are significantly more politicized than any field was in 1990. At the same time, abstracts have been becoming more similar to each other over time. Taken together, the results imply that there has been a politicization of scientific funding in the US in recent years and a decrease in the diversity of ideas supported. Parabiosis is the process of tying the blood vessels of animals. Zhang et al. proceeded with parabiosis between young and old mice, followed by a detachment period. The old mice lived longer than control mice and they appear to have been rejuvenated by the parabiosis. A single injection to enable paralyzed mice to walk again. The Artic has been warming for much longer than we thought. Dog fed only once daily are healthier. India fertility…

https://lemire.me/blog/2021/11/28/science-and-technology-links-novembre-28th-2021/

Daniel Lemire's blog

Science and Technology links (Novembre 28th 2021)

Government-funded research is getting more political and less diverse: The frequency of documents containing highly politicized terms has been increasing consistently over the last three decades. The most politicized field is Education & Human Resources.…

321 views18:17

Daniel Lemire's blog

Can you safely parse a double when you need a float?

In C as well as many other programming languages, we have 32-bit and 64-bit floating-point numbers. They are often referred to as float and double. Most of systems today follow the IEEE 754 standard which means that you can get consistent results across programming languages and operating systems. Hence, it does not matter very much if you implement your software in C++ under Linux whereas someone else implements it in C# under Windows: if you both have recent systems, you can expect identical numerical outcomes. When you are reading these numbers from a string, there are distinct functions. In C, you have strtof and strtod. One parses a string to a float and the other function parses it to a double. At a glance, it seems redundant. Why not just parse your string to a double value and cast it back to a float, if needed? Of course, that would be slightly more expensive. But, importantly, it is also gives incorrect results in the sense that it is not equivalent to parsing directly to a float. In other words, these functions…

https://lemire.me/blog/2021/11/30/can-you-safely-parse-a-double-and-then-cast-to-a-float/

Daniel Lemire's blog

Can you safely parse a double when you need a float?

In C as well as many other programming languages, we have 32-bit and 64-bit floating-point numbers. They are often referred to as float and double. Most of systems today follow the IEEE 754 standard which means that you can get consistent results across programming…

364 views23:26

Daniel Lemire's blog

Science and Technology links (December 4th 2021)

It used to be that all the exciting new processors came from Intel and AMD, and they were meant for your PC. The mobile revolution changed that: it lead to the production of fantastic processors that used little energy. We are now moving back into laptops and servers. The leading supplier of servers is probably Amazon: if you want to run servers for your business, your first choice is often Amazon and its cloud computing infrastructure. And Amazon now makes their own processors. Initially, they made low-cost chips that were secondary to Intel and AMD powerful processors. Amazon released its latest processor, Graviton 3. It is a much wider and deeper chip: The N1 core used in the Graviton2 chip had an instruction fetch unit that was 4 to 8 instructions wide and 4 wide instruction decoder that fed into an 8 wide issue unit that included two SIMD units, two load/store units, three arithmetic units, and a branch unit. With the Perseus N2 core used in the Graviton3, there is an 8 wide fetch unit that feeds into a…

https://lemire.me/blog/2021/12/04/science-and-technology-links-december-4th-2021/

1.1K views16:19

Daniel Lemire's blog

Science and Technology links (December 19th 2021)

Becoming a physician increases the use of antidepressants, opioids, anxiolytics, and sedatives, especially for female physicians. When trying to reproduce results in cancer researchers, independent researchers find that the benefits are typically grossly exaggerated. A large planet has been found orbiting two suns. It is orbiting from very far away. NASA sent a probe in the atmosphere of the sun. As you age, you accumulate old (senescent) cells. These cells are dysfunctional and in sufficient quantity might cause harm to your body such as chronic inflammation. Researchers found a plant-derived compound (procyanidin C1 or PCC1) which can not only limit the harm due to senescent cells but also eliminate some of them.

https://lemire.me/blog/2021/12/19/science-and-technology-links-december-19th-2021/

Daniel Lemire's blog

Science and Technology links (December 19th 2021)

Becoming a physician increases the use of antidepressants, opioids, anxiolytics, and sedatives, especially for female physicians. When trying to reproduce results in cancer researchers, independent researchers find that the benefits are typically grossly…

366 views21:01

Daniel Lemire's blog

How programmers make sure that their software is correct

Our most important goal in writing software is that it be correct. The software must do what the programmer wants it to do. It must meet the needs of the user. In the business world, double-entry bookkeeping is the idea that transactions are recorded in at least two accounts (debit and credit). One of the advantages of double-entry bookkeeping, compared to a more naive approach, is that it allows for some degree of auditing and error finding. If we compare accounting and software programming, we could say that double-entry accounting and its subsequent auditing is equivalent to software testing. For an accountant, converting a naive accounting system into a double-entry system is a difficult task in general. In many cases, one would have to reconstruct it from scratch. In the same manner, it can be difficult to add tests to a large application that has been developed entirely without testing. And that is why testing should be first on your mind when building serious software. A hurried or novice programmer can write quickly a routine, compile and run it and…

https://lemire.me/blog/2022/01/03/how-programmers-make-sure-that-their-software-is-correct/

368 views23:47

Daniel Lemire's blog

What is the ‘range’ of a number type?

In programming, we often represent numbers using types that have specific ranges. For example, 64-bit signed integer types can represent all integers between -9223372036854775808 and 9223372036854775807, inclusively. All integers inside this range are valid, all integers outside are “out of range”. It is simple. What about floating-point numbers? The nuance with floating-point numbers is that they cannot represent all numbers within a continuous range. For example, the real number 1/3 cannot be represented using binary floating-point numbers. So the convention is that given a textual representation, say “1.1e100”, we seek the closest approximation. Still, are there ranges of numbers that you should not represent using floating-point numbers? That is, are there numbers that you should reject? It seems that there are two different interpretation: My own interpretation is that floating-point types can represent all numbers from -infinity to infinity, inclusively. It means that ‘infinity’ or 1e9999 are indeed “in range”. For 64-bit IEEE floating-point numbers, this means that numbers smaller than 4.94e-324 but greater than 0 can be represented as 0, and that numbers greater than 1.8e308 should be infinity.…

https://lemire.me/blog/2022/01/17/what-is-the-range-of-a-number-type/

Daniel Lemire's blog

What is the ‘range’ of a number type?

In programming, we often represent numbers using types that have specific ranges. For example, 64-bit signed integer types can represent all integers between -9223372036854775808 and 9223372036854775807, inclusively. All integers inside this range are valid…

393 views21:53

Daniel Lemire's blog

SWAR explained: parsing eight digits

It is common to want to parse long strings of digits into integer values. Because it is a common task, we want to optimize it as much as possible. In the blog post, Quickly parsing eight digits, I presented a very quick way to parse eight ASCII characters representing an integers (e.g., 12345678) into the corresponding binary value. I want to come back to it and explain it a bit more, to show that it is not magic. This works in most programming languages, but I will stick with C for this blog post. To recap, the long way is a simple loop: uint32_t parse_eight_digits(const unsigned char *chars) { uint32_t x = chars[0] - '0'; for (size_t j = 1; j < 8; j++) x = x * 10 + (chars[j] - '0'); return x; } We use the fact that in ASCII, the numbers 0, 1, … are in consecutive order in terms of byte values. The character ‘0’ is 0x30 (or 48 in decimal), the character ‘1’ is 0x31 (49 in decimal) and so forth. At each step…

https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/

402 views22:52

Daniel Lemire's blog

The end of the monopolistic web era?

Except maybe in totalitarian states, you cannot ever have a single publisher. Most large cities had multiple independent newspapers. In recent years, we saw a surge of concentration regarding newspaper and television ownership. However, this was accompanied by a surge of online journalism. The total number of publishers increased, if nothing else. Yet you can more easily have a single carrier/distributor than monopolistic publisher. For example, the same delivery service provides me my newspaper as well as a range of competing newspapers. The delivery man does not much care for the content of my newspaper. A few concentrated Internet providers support diverse competing services. The current giants (Facebook, Twitter and Google) were built initially as neutral distributors. Google was meant to give you access to all of the web’s information. If the search engine is neutral, there is no reason to have more than one. If twitter welcomes everyone, then there is no reason to have competing services. Newspapers have fact-checking services, but newspaper delivery services do not. Of course, countries like Russia and China often had competing services, but…

https://lemire.me/blog/2022/02/07/the-end-of-the-monopolistic-web-era/

1.2K views17:40

Daniel Lemire's blog

The Canadian Common CV and the captured academy

Most Canadian academics have to write their resumes using a government online tool called the Common CV. When it was first introduced, it was described as a time-saving tool: instead of writing your resume multiple times for different grant agencies, you would write it just once and be done with it. In practice, it turned into something of a nightmare for many of us. You have to adapt it each time for each grant application, sometimes in convoluted ways, using a clunky interface. What the Common CV does do is provide much power to middle-managers who can insert various bureaucratic requirements. You have to use their tool, and they can tailor it administratively without your consent. It is part of an ongoing technocratic invasion. How did Canadian academics react? Did the revolt? Not at all. In fact, they are embracing it. I recently had to formally submit my resume as part of a routine review process, they asked for my Common CV. That is, instead of fighting against the techno-bureaucratic process, they extend its application to every aspect of their lives. And…

https://lemire.me/blog/2022/02/18/the-canadian-common-cv-and-the-captured-academy/

Daniel Lemire's blog

The Canadian Common CV and the captured academy

Most Canadian academics have to write their resumes using a government online tool called the Common CV. When it was first introduced, it was described as a time-saving tool: instead of writing your resume multiple times for different grant agencies, you…

365 views23:36

Daniel Lemire's blog

Enforcement by software

At my university, one of our internal software systems allows a professor to submit a revision to a course. The professor might change the content or the objectives of the course. In a university, professors have extensive freedom regarding course content. As long as you reasonably meet the course objectives, you can do whatever you like. You can pick the textbook you prefer, write your own and so forth. However, the staff that built our course revision system decided to make it so that every single change should go through all layers of approvals. So if I want to change the title of an assignment, according to this tool, I need the department to approve. When I first encountered this new tool, I immediately started to work around it. And because I am department chair, I brought my colleagues along for the ride. So we ‘pretend’ to get approval, submitting fake documents. The good thing in a bureaucracy is that most people are too bored to check up on the fine prints. Surprisingly, it appears that no other department has…

https://lemire.me/blog/2022/02/24/enforcement-by-software/

Daniel Lemire's blog

Enforcement by software

At my university, one of our internal software systems allows a professor to submit a revision to a course. The professor might change the content or the objectives of the course. In a university, professors have extensive freedom regarding course content.…

350 views01:55

Daniel Lemire's blog

Enforcement by software

At my university, one of our internal software systems allows a professor to submit a revision to a course. The professor might change the content or the objectives of the course. In a university, professors have extensive freedom regarding course content. As long as you reasonably meet the course objectives, you can do whatever you like. You can pick the textbook you prefer, write your own and so forth. However, the staff that built our course revision system decided to make it so that every single change should go through all layers of approvals. So if I want to change the title of an assignment, according to this tool, I need the department to approve. When I first encountered this new tool, I immediately started to work around it. And because I am department chair, I brought my colleagues along for the ride. So we ‘pretend’ to get approval, submitting fake documents. The good thing in a bureaucracy is that most people are too bored to check up on the fine prints. Surprisingly, it appears that no other department has…

https://lemire.me/blog/2022/02/24/enforcement-by-software/

Daniel Lemire's blog

Enforcement by software

At my university, one of our internal software systems allows a professor to submit a revision to a course. The professor might change the content or the objectives of the course. In a university, professors have extensive freedom regarding course content.…

411 views01:55

Daniel Lemire's blog

Writing out large arrays in Go: binary.Write is inefficient for large arrays

Programmers often need to write data structures to disk or to networks. The data structure then needs to be interpreted as a sequence of bytes. Regarding integer values, most computer systems adopt “little endian” encoding whereas an 8-byte integer is written out using the least significant bytes first. In the Go programming language, you can write an array of integers to a buffer as follows: var data []uint64 var buf *bytes.Buffer = new(bytes.Buffer) ... err := binary.Write(buf, binary.LittleEndian, data) Until recently, I assumed that the binary.Write function did not allocate memory. Unfortunately, it does. The function converts the input array to a new, temporary byte arrays. Instead, you can create a small buffer just big enough to hold you 8-byte integer and write that small buffer repeatedly: var item = make([]byte, 8) for _, x := range data { binary.LittleEndian.PutUint64(item, x) buf.Write(item) } Sadly, this might have poor performance on disks or networks where each write/read has a high overhead. To avoid this problem, you can use Go’s buffered writer and write the integers one by one. Internally, Go will allocate…

https://lemire.me/blog/2022/03/18/writing-out-large-arrays-in-go-binary-write-is-inefficient-for-large-arrays/

382 views22:25

Daniel Lemire's blog

Converting integers to decimal strings faster with AVX-512

In most systems, integers are stored using a fixed binary representation. It is common to store integers using 32-bit or 64-bit words. You sometimes need to convert it to a string. For example, the integer 12345 might need to be converted to the five characters ‘12345’. In an earlier blog post, I presented the simpler problem of converting integers to fixed-digit strings, using exactly 16-characters with leading zeroes as needed. For example, the integer 12345 becomes the string ‘0000000000012345’. For this problem, the most practical approach might be a tree-based version with a small table. The core idea is to start from the integer, compute an integer representing the 8 most significant decimal digits, and another integer representing the least significant 8 decimal digit. Then we repeat, dividing the two eight-digit integers into two four-digit integers, and so forth until we get to two-digit integers in which case we use a small table to convert them to a decimal representation. The code in C++ might look as follows: void to_string_tree_table(uint64_t x, char *out) { static const char table[200] = {…

https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/

298 views21:51

Daniel Lemire's blog

Converting integers to decimal strings faster with AVX-512

In most systems, integers are stored using a fixed binary representation. It is common to store integers using 32-bit or 64-bit words. You sometimes need to convert it to a string. For example, the integer 12345 might need to be converted to the five characters ‘12345’. In an earlier blog post, I presented the simpler problem of converting integers to fixed-digit strings, using exactly 16-characters with leading zeroes as needed. For example, the integer 12345 becomes the string ‘0000000000012345’. For this problem, the most practical approach might be a tree-based version with a small table. The core idea is to start from the integer, compute an integer representing the 8 most significant decimal digits, and another integer representing the least significant 8 decimal digit. Then we repeat, dividing the two eight-digit integers into two four-digit integers, and so forth until we get to two-digit integers in which case we use a small table to convert them to a decimal representation. The code in C++ might look as follows: void to_string_tree_table(uint64_t x, char *out) { static const char table[200] = {…

https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/

319 views21:51

Daniel Lemire's blog

Converting integers to decimal strings faster with AVX-512

In most systems, integers are stored using a fixed binary representation. It is common to store integers using 32-bit or 64-bit words. You sometimes need to convert it to a string. For example, the integer 12345 might need to be converted to the five characters ‘12345’. In an earlier blog post, I presented the simpler problem of converting integers to fixed-digit strings, using exactly 16-characters with leading zeroes as needed. For example, the integer 12345 becomes the string ‘0000000000012345’. For this problem, the most practical approach might be a tree-based version with a small table. The core idea is to start from the integer, compute an integer representing the 8 most significant decimal digits, and another integer representing the least significant 8 decimal digit. Then we repeat, dividing the two eight-digit integers into two four-digit integers, and so forth until we get to two-digit integers in which case we use a small table to convert them to a decimal representation. The code in C++ might look as follows: void to_string_tree_table(uint64_t x, char *out) { static const char table[200] = {…

https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/

369 views21:55

Daniel Lemire's blog

String representations are not unique: learn to normalize!

Most strings in software today are represented using the unicode standard. The unicode standard can represent most human readable strings. Unicode works by representing each ‘character’ as a numerical value (called a code point) between 0 and 1 114 112. Thus the character é is typically represented as the numerical value 233 (or 0xe9 in hexadecimal). Thus in Python, JavaScript and many other programming languages, you get the following: >>> "u00e9" 'é' Unfortunately, unicode does not ensure that there is a unique way to achieve every visual character. For example, you can combine the letter ‘e’ (code point 0x65) with ‘acute accent’ (code point 0x0301): >>> "u0065u0301" 'é' Unfortunately, in most programming languages, these strings will not be considered to be the same even though they look the same to us: >>> "u0065u0301"=="u00e9" False For obvious reason, it can be a problem within a computer system. What if you are doing some search in a database for name with the character ‘é’ in it? The standard solution is to normalize your strings. In effect, you transform them so that strings…

https://lemire.me/blog/2022/04/05/string-representations-are-not-unique-learn-to-normalize/

Daniel Lemire's blog

String representations are not unique: learn to normalize!

Most strings in software today are represented using the unicode standard. The unicode standard can represent most human readable strings. Unicode works by representing each 'character' as a numerical value (called a code point) between 0 and 1 114 112. …

343 views15:50

Daniel Lemire's blog

Floats have 15-digit accuracy in their normal range

In programming languages like JavaScript or Python, numbers are typically represented using 64-bit IEEE number types (binary64). For these numbers, we have 15 digits of accuracy. It means that you can pick a 15-digit number, such as 1.23456789012345e100 and it can be represented exactly: there exists a floating-point number that has exactly these 15-most significant digits. In this particular case, it is the number 6355009312518497 * 2280. Obviously, it fails for numbers that outside the valid range. For example, the number 1e500 is too large and cannot be directly represented using standard 64-bit floating-point numbers. Similarly, 1e-500 is too small and it can only be represented as zero. The range of 64-bit floating-point number might be defined as going from 4.94e-324 to 1.8e308 and -1.8e308 to -4.94e-324, together with exactly 0. However, this range includes subnormal numbers where the relative accuracy can be small. For example, the number 5.00000000000000e-324 is best represented as 4.94065645841247e-324, meaning that we have zero-digit accuracy. For the 15-digit accuracy rule to work, you might remain in the normal range, e.g., from 2.225e−308 to 1.8e308 and…

https://lemire.me/blog/2022/04/13/floats-have-15-digit-accuracy-in-their-normal-range/

Daniel Lemire's blog

Floats have 15-digit accuracy in their normal range

In programming languages like JavaScript or Python, numbers are typically represented using 64-bit IEEE number types (binary64). For these numbers, we have 15 digits of accuracy. It means that you can pick a 15-digit number, such as 1.23456789012345e100 and…

301 views17:31

Daniel Lemire's blog

Floats have 15-digit accuracy in their normal range

In programming languages like JavaScript or Python, numbers are typically represented using 64-bit IEEE number types (binary64). For these numbers, we have 15 digits of accuracy. It means that you can pick a 15-digit number, such as 1.23456789012345e100 and it can be represented exactly: there exists a floating-point number that has exactly these 15-most significant digits. In this particular case, it is the number 6355009312518497 * 2280. Having 15-digit of accuracy is excellent: few engineering processes can ever exceed this accuracy. This 15-digit accuracy fails for numbers that outside the valid range. For example, the number 1e500 is too large and cannot be directly represented using standard 64-bit floating-point numbers. Similarly, 1e-500 is too small and it can only be represented as zero. The range of 64-bit floating-point number might be defined as going from 4.94e-324 to 1.8e308 and -1.8e308 to -4.94e-324, together with exactly 0. However, this range includes subnormal numbers where the relative accuracy can be small. For example, the number 5.00000000000000e-324 is best represented as 4.94065645841247e-324, meaning that we have zero-digit accuracy. For the 15-digit accuracy rule…

https://lemire.me/blog/2022/04/13/floats-have-15-digit-accuracy-in-their-normal-range/

Daniel Lemire's blog

Floats have 15-digit accuracy in their normal range

In programming languages like JavaScript or Python, numbers are typically represented using 64-bit IEEE number types (binary64). For these numbers, we have 15 digits of accuracy. It means that you can pick a 15-digit number, such as 1.23456789012345e100 and…

388 views18:17