Airflow 2.0
At the end of 2020, the major release of Apache Airflow 2.0 came out. Here are the most important things it brought with it:
1. New task scheduler. The Airflow Scheduler itself was a single point of failure and not horizontally scalable. Now it has changed - Airflow should be able to continue running data pipelines without a hiccup, even in the situation of a node failure taking down a Scheduler. Cool, right?
2. Restful API. Now we can do CRUD operations programmatically through the API! Plus, the updated security model - all REST operations now require authorization.
3. SubDAG operator was substituted with Task Groups.
4. Smart sensors. Although sensors are idle for most of the execution time, they keep a "slot" that consumes CPU and memory resources. Smart Sensor reduces the number of those slots during high load times.
Source
#big_data
At the end of 2020, the major release of Apache Airflow 2.0 came out. Here are the most important things it brought with it:
1. New task scheduler. The Airflow Scheduler itself was a single point of failure and not horizontally scalable. Now it has changed - Airflow should be able to continue running data pipelines without a hiccup, even in the situation of a node failure taking down a Scheduler. Cool, right?
2. Restful API. Now we can do CRUD operations programmatically through the API! Plus, the updated security model - all REST operations now require authorization.
3. SubDAG operator was substituted with Task Groups.
4. Smart sensors. Although sensors are idle for most of the execution time, they keep a "slot" that consumes CPU and memory resources. Smart Sensor reduces the number of those slots during high load times.
Source
#big_data
Apache Airflow
Apache Airflow 2.0 is here!
We're proud to announce that Apache Airflow 2.0.0 has been released.
How LinkedIn engineering team moved some of their pipelines from Lambda architecture to Kappa architecture. Check it out
They faced the same problems as I described here sadly not all of them they described in the post.
They faced the same problems as I described here sadly not all of them they described in the post.
Linkedin
From Lambda to Lambda-less: Lessons learned
Co-authors: Xiang Zhang and Jingyu Zhu
Just a terrific lightning talk about interesting features of two not-so-famous languages.
I screamed with laughter 😆
I screamed with laughter 😆
Another area of knowledge, which is fundamentally situated in a slightly different plane but directly related to the data. Management challenges tackle privacy, security, governance, and data/metadata management.
https://luminousmen.com/post/management-challenges-in-big-data
https://luminousmen.com/post/management-challenges-in-big-data
Blog | iamluminousmen
Management Challenges in Big Data
Management challenges tackle privacy, security, governance, and data/metadata management. Another area of knowledge directly connected to the big data
Snowflake
In my opinion, traditional storage methods and technologies face great challenges in delivering the service, simplicity and value that a rapidly changing business needs. In addition to eliminating these limitations, Snowflake provided significant performance benefits, an easy and intuitive way for administrators and users to interact with the system, and finally a way to scale to levels of concurrency that cost a huge pile of money in a MPP architecture and impossible in traditional approaches.
Snowflake or SnowflakeDB is a cloud-based SaaS database for analytical workloads and batch data processing, typically used to build a Data Warehouse in the cloud. Snowflake supports transactions, different isolation levels, ACID, read consistency and multi version concurrency control. The main features of Snowflake are decoupled storage from compute and cloud-based focus.
There's been too much news about Snowflake lately. Everyone is covering how much of a silver bullet it can be. While I like the technology side of things, it also has its downsides. Going through the article I'll make a summary for you:
▪️it's a proprietary SaaS solution. You can't solve any problem other than with the company’ support team
▪️it's a cloud native solution, you can't deploy it on-premise. Hence you not really “own” the data
▪️it’s opinionated solution. You can't optimize it much as there is not much knobs that you can change
▪️it's expensive. Depending on the use case, Snowflake may be more expensive than competitors like Amazon Redshift and traditional MPP solutions like Vertica
#big_data
In my opinion, traditional storage methods and technologies face great challenges in delivering the service, simplicity and value that a rapidly changing business needs. In addition to eliminating these limitations, Snowflake provided significant performance benefits, an easy and intuitive way for administrators and users to interact with the system, and finally a way to scale to levels of concurrency that cost a huge pile of money in a MPP architecture and impossible in traditional approaches.
Snowflake or SnowflakeDB is a cloud-based SaaS database for analytical workloads and batch data processing, typically used to build a Data Warehouse in the cloud. Snowflake supports transactions, different isolation levels, ACID, read consistency and multi version concurrency control. The main features of Snowflake are decoupled storage from compute and cloud-based focus.
There's been too much news about Snowflake lately. Everyone is covering how much of a silver bullet it can be. While I like the technology side of things, it also has its downsides. Going through the article I'll make a summary for you:
▪️it's a proprietary SaaS solution. You can't solve any problem other than with the company’ support team
▪️it's a cloud native solution, you can't deploy it on-premise. Hence you not really “own” the data
▪️it’s opinionated solution. You can't optimize it much as there is not much knobs that you can change
▪️it's expensive. Depending on the use case, Snowflake may be more expensive than competitors like Amazon Redshift and traditional MPP solutions like Vertica
#big_data
Distributed Systems Architecture
Snowflake: The Good, The Bad and The Ugly
Snowflake or SnowflakeDB is a cloud SaaS database for analytical workloads and batch data ingestion, typically used for building a data warehouse in the cloud. However, it appears to be so cool and shiny that people are getting mad at praising it all around…
When you’re choosing a base image for your Docker image, Alpine Linux is often recommended. Using Alpine, you’re told, will make your images smaller and speed up your builds. But Alpine builds are vastly slower, the image is bigger.
I've faced the same issues with the custom Airflow/Python container as the author describing here, and as a solution, I switched to the slim docker images. Doing fine so far :)
#python
I've faced the same issues with the custom Airflow/Python container as the author describing here, and as a solution, I switched to the slim docker images. Doing fine so far :)
#python
Python⇒Speed
Using Alpine can make Python Docker builds 50× slower
Alpine Linux is often recommended as a smaller, faster Docker base image. But if you’re using Python, it will slow down your build and make your image larger.
JetBrains published the results of its Python Developers Survey for 2020. After questioning over 28,000 Python developers and fans from nearly 200 countries/regions in October 2020, here are some of the notable insights:
⚡️ JavaScript is the most popular language for developers to use with Python, particularly for web developers. As the survey notes, with HTML/CSS, Bash/Shell, and SQL, "they create a stack of languages where 2 out of every 5 Python devs are using at least one of them."
⚡️ JavaScript and C/C++ are the most common main languages for those who use Python as a secondary language.
⚡️ 55% surveyed use Python for data analysis, making it the top use case, with 50% using it for web development.
⚡️ 94% use Python 3, with 6% using Python 2. 44% are using Python 3.8.
⚡️ Flask is the most popular web framework (46%), followed by Django at 43% and FastAPI at 12%.
⚡️ Most Pythonistas who use Flask prefer SQLAlchemy, while Django users use Django ORM.
⚡️ PostgreSQL is the most popular database amongst Python developers (45%).
⚡️ AWS is the top cloud platform (53%), followed by Google Cloud (33%).
⚡️ Linux is the most popular operating system (68%), followed by Windows (48%).
Check out the full report for more insights.
#python
⚡️ JavaScript is the most popular language for developers to use with Python, particularly for web developers. As the survey notes, with HTML/CSS, Bash/Shell, and SQL, "they create a stack of languages where 2 out of every 5 Python devs are using at least one of them."
⚡️ JavaScript and C/C++ are the most common main languages for those who use Python as a secondary language.
⚡️ 55% surveyed use Python for data analysis, making it the top use case, with 50% using it for web development.
⚡️ 94% use Python 3, with 6% using Python 2. 44% are using Python 3.8.
⚡️ Flask is the most popular web framework (46%), followed by Django at 43% and FastAPI at 12%.
⚡️ Most Pythonistas who use Flask prefer SQLAlchemy, while Django users use Django ORM.
⚡️ PostgreSQL is the most popular database amongst Python developers (45%).
⚡️ AWS is the top cloud platform (53%), followed by Google Cloud (33%).
⚡️ Linux is the most popular operating system (68%), followed by Windows (48%).
Check out the full report for more insights.
#python
JetBrains: Developer Tools for Professionals and Teams
Python Developers Survey 2020 Results
Official Python Developers Survey 2020 Results by Python Software Foundation and JetBrains: more than 28k responses from more than 150 countries.
I want to share this seminar series which I've been following for a while now.
So Standford created a MLSys series where they taking a look at the frontier of machine learning systems, and how machine learning changes the modern programming stack. They really have a lot of rockstar speakers and interesting conversations! Recommended 👌
https://mlsys.stanford.edu/
#ml
So Standford created a MLSys series where they taking a look at the frontier of machine learning systems, and how machine learning changes the modern programming stack. They really have a lot of rockstar speakers and interesting conversations! Recommended 👌
https://mlsys.stanford.edu/
#ml
mlsys.stanford.edu
Stanford MLSys Seminar
Seminar series on the frontier of machine learning and systems.
Pattern matching
For a long time, the Python community has put forward various suggestions for the realization of specified multi-branch conditions in the Python language (similar to the switch statement in C/C++), but none of the proposals can be finally implemented. In the past year or so, the Python community discussed a proposal that might solve the multi-branch condition problem (or even more) and was adopted the pattern matching suggestions. The specific situation is to adopt PEP 634, 635, and 636. It will be a feature of new release Python 3.10.
Examples are as follows:
#python
For a long time, the Python community has put forward various suggestions for the realization of specified multi-branch conditions in the Python language (similar to the switch statement in C/C++), but none of the proposals can be finally implemented. In the past year or so, the Python community discussed a proposal that might solve the multi-branch condition problem (or even more) and was adopted the pattern matching suggestions. The specific situation is to adopt PEP 634, 635, and 636. It will be a feature of new release Python 3.10.
Examples are as follows:
match command.split():For me it look like some Scala-like syntax. Which is not bad.
case ["quit"]:
print("Goodbye!")
quit_game()
case ["get", obj]:
character.get(obj, current_room)
case ["go", direction]:
current_room = current_room.neighbor(direction)
case _:
print(f"Sorry, I couldn't understand {command!r}")
#python
Python Enhancement Proposals (PEPs)
PEP 634 – Structural Pattern Matching: Specification | peps.python.org
This PEP provides the technical specification for the match statement. It replaces PEP 622, which is hereby split in three parts:
Hey everybody,
Channel is growing so let me collect all the interesting posts from this channel that I think should get more attention:
#dev
⚡️Technological degradation
⚡️Define "production-ready"
⚡️Abstraction is not OOP
#python
⚡️Use pathlib
⚡️Pip constraints files
⚡️.pth files
#soft_skills
⚡️Your top skill
⚡️Ask stupid questions
⚡️Let robots yell at people
⚡️Soft skills thoughts
#big_data
⚡️S3 vs HDFS
⚡️Famous in-memory data format
⚡️Complexity in distributed systems
⚡️Snowflake
#ml
⚡️ML system basic framework
⚡️AutoML
⚡️MLOps
⚡️Testing and validation in ML
Bigger posts I'm sharing on my small website ✌️
Channel is growing so let me collect all the interesting posts from this channel that I think should get more attention:
#dev
⚡️Technological degradation
⚡️Define "production-ready"
⚡️Abstraction is not OOP
#python
⚡️Use pathlib
⚡️Pip constraints files
⚡️.pth files
#soft_skills
⚡️Your top skill
⚡️Ask stupid questions
⚡️Let robots yell at people
⚡️Soft skills thoughts
#big_data
⚡️S3 vs HDFS
⚡️Famous in-memory data format
⚡️Complexity in distributed systems
⚡️Snowflake
#ml
⚡️ML system basic framework
⚡️AutoML
⚡️MLOps
⚡️Testing and validation in ML
Bigger posts I'm sharing on my small website ✌️
L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵ pinned «Hey everybody, Channel is growing so let me collect all the interesting posts from this channel that I think should get more attention: #dev ⚡️Technological degradation ⚡️Define "production-ready" ⚡️Abstraction is not OOP #python ⚡️Use pathlib ⚡️Pip constraints…»
Perhaps not every Python developer knows the interesting property of the CPython that makes newcomers go crazy:
There is actually an optimization in Python regarding small integers (-5 to 256 inclusive). These objects are loaded into the interpreter's memory when the interpreter starts up. This results in a small internal cache. Because of this, the variables with the same values point to the same object and the result is True.
A similar example for number > 256 will work as expected:
>>> a = 255The double equal operator checks that the objects values are equal and the is operator checks that the variables refer to the same object. a and b are different objects, so a is b will return False.
>>> b = 255
>>> a == b
True
>>> a is b
True
There is actually an optimization in Python regarding small integers (-5 to 256 inclusive). These objects are loaded into the interpreter's memory when the interpreter starts up. This results in a small internal cache. Because of this, the variables with the same values point to the same object and the result is True.
A similar example for number > 256 will work as expected:
>>> a = 257#python
>>> b = 257
>>> a == b
True
>>> a is b
False
I first wanted to call the article Data Engineering terms, but then I thought that I was describing more than just terminology. We can think of those terms as architectural requirements that greatly affect the design, cost, and ultimate usefulness of the systems. Behind the marketing and noise, there are real problems that need to be tackled.
https://luminousmen.com/post/architecturally-significant-requirements
https://luminousmen.com/post/architecturally-significant-requirements
Blog | iamluminousmen
Architecturally Significant Requirements
Discover the crucial Architecturally Significant Requirements (ASR) for distributed systems, including Availability, Durability, Resiliency, Reliability, and Scalability. Learn how these factors impact system design and performance.
A collection of AWS related videos, podcasts, code repositories, whitepapers, and feature releases, all in a single, easy to search interface.
https://awsstash.com/
#aws
https://awsstash.com/
#aws
Btw this also work for strings:
In Python 3.6, any string with length ≤ 20 will get interned. But Python 3.7 uses the AST optimizer and (most) strings up to 4096 characters are interned.
🔸 Saving memory
🔸 Fast comparisons
🔸 Fast dictionary lookups
#python
>>> a = 'the'
>>> b = 'the'
>>> a == b
True
>>> a is b
True
As the strings are immutable it makes sense for the interpreter to store the string literal only once and point all the variables to the same object. It is called string interning. And it's kind of internal optimization technique for working with valid identifiers. In Python 3.6, any string with length ≤ 20 will get interned. But Python 3.7 uses the AST optimizer and (most) strings up to 4096 characters are interned.
>>> a = 'the thing'
>>> b = 'the thing'
>>> a == b
True
>>> a is b
False
Wtf you may be wonder. This string is just not a valid identifier that's all. But we can make interning explicit:>>> import sys
>>> b = sys.intern('the thing')
>>> a = sys.intern('the thing')
>>> a == b
True
>>> a is b
True
Why?🔸 Saving memory
🔸 Fast comparisons
🔸 Fast dictionary lookups
#python
Telegram
L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵
Perhaps not every Python developer knows the interesting property of the CPython that makes newcomers go crazy:
>>> a = 255
>>> b = 255
>>> a == b
True
>>> a is b
True
The double equal operator checks that the objects values are equal and the is operator…
>>> a = 255
>>> b = 255
>>> a == b
True
>>> a is b
True
The double equal operator checks that the objects values are equal and the is operator…
Delta Lake vs Iceberg vs Hudi vs Hive ACID
Many companies realized that while Hadoop was initially cheap to set up, it required hiring much more expensive engineers to write optimal codes, otherwise the user community could easily and quickly direct an inefficient workload to a big data system, and then this led to problems with SLA, scalability, and maintenance costs.
Hadoop-based approaches in most cases still cheaper than relying mostly on Redshift/Teradata/Exadata/Netezza/Vertica, but it is obvious why Lakehouse(new marketing term from Databricks) solutions like Hive ACID, Delta Lake, Apache Hudi, Iceberg started to appear.
▪️Among them, the Apache Spark great success in commercial, so behind by commercial companies, Databricks launched Delta Lake.
▪️Apache Hudi by Uber engineers to meet its internal data analysis needs designed a Data Lake, which provides fast upsert, delete, compaction, and other functions to accurately his pain points. Plus the members of the project's community sharing technical details and gradually attracting the attention of potential users. Btw Hudi has been integrated into EMR long ago, it is very attractive to use Hudi on AWS.
▪️Apache Iceberg moment would appear to be relatively mediocre from a marketing point of view, functionality is not that rich, but it is an ambitious project because it has high-level abstractions and a very elegant design.
So,
➕ Delta Lake has the best momentum
➕ Iceberg has the best design
➕Hudi has an awesome performance
➖Hive is unfortunately fading away
Source
#big_data
Many companies realized that while Hadoop was initially cheap to set up, it required hiring much more expensive engineers to write optimal codes, otherwise the user community could easily and quickly direct an inefficient workload to a big data system, and then this led to problems with SLA, scalability, and maintenance costs.
Hadoop-based approaches in most cases still cheaper than relying mostly on Redshift/Teradata/Exadata/Netezza/Vertica, but it is obvious why Lakehouse(new marketing term from Databricks) solutions like Hive ACID, Delta Lake, Apache Hudi, Iceberg started to appear.
▪️Among them, the Apache Spark great success in commercial, so behind by commercial companies, Databricks launched Delta Lake.
▪️Apache Hudi by Uber engineers to meet its internal data analysis needs designed a Data Lake, which provides fast upsert, delete, compaction, and other functions to accurately his pain points. Plus the members of the project's community sharing technical details and gradually attracting the attention of potential users. Btw Hudi has been integrated into EMR long ago, it is very attractive to use Hudi on AWS.
▪️Apache Iceberg moment would appear to be relatively mediocre from a marketing point of view, functionality is not that rich, but it is an ambitious project because it has high-level abstractions and a very elegant design.
So,
➕ Delta Lake has the best momentum
➕ Iceberg has the best design
➕Hudi has an awesome performance
➖Hive is unfortunately fading away
Source
#big_data
Medium
Reshape Data Lake: Delta, Iceberg, Hudi, or Hive
The super success of Spark in the ETL area also showed that many paradigms in the traditional data warehouse are indeed critical and useful
Another StackOverflow question that can be a decent article. Double precision is different in different languages
Stages of greif in emoji should be appropriate in this message:
denial - 😊
anger - 😡
bargaining - 😣
depression - 😭
acceptance - 😔
Stages of greif in emoji should be appropriate in this message:
denial - 😊
anger - 😡
bargaining - 😣
depression - 😭
acceptance - 😔
Stack Overflow
Double precision is different in different languages
I'm experimenting with the precision of a double value in various programming languages.
My programs
main.c
#include <stdio.h>
int main() {
for (double i = 0.0; i < 3; i = i + 0.1) {
...
My programs
main.c
#include <stdio.h>
int main() {
for (double i = 0.0; i < 3; i = i + 0.1) {
...