L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵
502 subscribers
156 photos
32 videos
2 files
701 links
(ノ◕ヮ◕)ノ*:・゚✧ ✧゚・: *ヽ(◕ヮ◕ヽ)

helping robots conquer the earth and trying not to increase entropy using Python, Data Engineering and Machine Learning

http://luminousmen.com

License: CC BY-NC-ND 4.0
Download Telegram
Airflow 2.0

At the end of 2020, the major release of Apache Airflow 2.0 came out. Here are the most important things it brought with it:

1. New task scheduler. The Airflow Scheduler itself was a single point of failure and not horizontally scalable. Now it has changed - Airflow should be able to continue running data pipelines without a hiccup, even in the situation of a node failure taking down a Scheduler. Cool, right?

2. Restful API. Now we can do CRUD operations programmatically through the API! Plus, the updated security model - all REST operations now require authorization.

3. SubDAG operator was substituted with Task Groups.

4. Smart sensors. Although sensors are idle for most of the execution time, they keep a "slot" that consumes CPU and memory resources. Smart Sensor reduces the number of those slots during high load times.

Source

#big_data
How LinkedIn engineering team moved some of their pipelines from Lambda architecture to Kappa architecture. Check it out

They faced the same problems as I described here sadly not all of them they described in the post.
Just a terrific lightning talk about interesting features of two not-so-famous languages.

I screamed with laughter 😆
Another area of knowledge, which is fundamentally situated in a slightly different plane but directly related to the data. Management challenges tackle privacy, security, governance, and data/metadata management.

https://luminousmen.com/post/management-challenges-in-big-data
Snowflake

In my opinion, traditional storage methods and technologies face great challenges in delivering the service, simplicity and value that a rapidly changing business needs. In addition to eliminating these limitations, Snowflake provided significant performance benefits, an easy and intuitive way for administrators and users to interact with the system, and finally a way to scale to levels of concurrency that cost a huge pile of money in a MPP architecture and impossible in traditional approaches.

Snowflake or SnowflakeDB is a cloud-based SaaS database for analytical workloads and batch data processing, typically used to build a Data Warehouse in the cloud. Snowflake supports transactions, different isolation levels, ACID, read consistency and multi version concurrency control. The main features of Snowflake are decoupled storage from compute and cloud-based focus.

There's been too much news about Snowflake lately. Everyone is covering how much of a silver bullet it can be. While I like the technology side of things, it also has its downsides. Going through the article I'll make a summary for you:

▪️it's a proprietary SaaS solution. You can't solve any problem other than with the company’ support team
▪️it's a cloud native solution, you can't deploy it on-premise. Hence you not really “own” the data
▪️it’s opinionated solution. You can't optimize it much as there is not much knobs that you can change
▪️it's expensive. Depending on the use case, Snowflake may be more expensive than competitors like Amazon Redshift and traditional MPP solutions like Vertica

#big_data
When you’re choosing a base image for your Docker image, Alpine Linux is often recommended. Using Alpine, you’re told, will make your images smaller and speed up your builds. But Alpine builds are vastly slower, the image is bigger.

I've faced the same issues with the custom Airflow/Python container as the author describing here, and as a solution, I switched to the slim docker images. Doing fine so far :)

#python
JetBrains published the results of its Python Developers Survey for 2020. After questioning over 28,000 Python developers and fans from nearly 200 countries/regions in October 2020, here are some of the notable insights:

⚡️ JavaScript is the most popular language for developers to use with Python, particularly for web developers. As the survey notes, with HTML/CSS, Bash/Shell, and SQL, "they create a stack of languages where 2 out of every 5 Python devs are using at least one of them."
⚡️ JavaScript and C/C++ are the most common main languages for those who use Python as a secondary language.
⚡️ 55% surveyed use Python for data analysis, making it the top use case, with 50% using it for web development.
⚡️ 94% use Python 3, with 6% using Python 2. 44% are using Python 3.8.
⚡️ Flask is the most popular web framework (46%), followed by Django at 43% and FastAPI at 12%.
⚡️ Most Pythonistas who use Flask prefer SQLAlchemy, while Django users use Django ORM.
⚡️ PostgreSQL is the most popular database amongst Python developers (45%).
⚡️ AWS is the top cloud platform (53%), followed by Google Cloud (33%).
⚡️ Linux is the most popular operating system (68%), followed by Windows (48%).

Check out the full report for more insights.

#python
I want to share this seminar series which I've been following for a while now.

So Standford created a MLSys series where they taking a look at the frontier of machine learning systems, and how machine learning changes the modern programming stack. They really have a lot of rockstar speakers and interesting conversations! Recommended 👌

https://mlsys.stanford.edu/

#ml
Pattern matching

For a long time, the Python community has put forward various suggestions for the realization of specified multi-branch conditions in the Python language (similar to the switch statement in C/C++), but none of the proposals can be finally implemented. In the past year or so, the Python community discussed a proposal that might solve the multi-branch condition problem (or even more) and was adopted the pattern matching suggestions. The specific situation is to adopt PEP 634, 635, and 636. It will be a feature of new release Python 3.10.

Examples are as follows:

match command.split(): 
case ["quit"]:
print("Goodbye!")
quit_game()
case ["get", obj]:
character.get(obj, current_room)
case ["go", direction]:
current_room = current_room.neighbor(direction)
case _:
print(f"Sorry, I couldn't understand {command!r}")

For me it look like some Scala-like syntax. Which is not bad.

#python
This guy shared a Data Science cheatsheet to assist with exam reviews, interview prep, etc. Worth sharing.

#ds
​​Hey everybody,

Channel is growing so let me collect all the interesting posts from this channel that I think should get more attention:

#dev
⚡️Technological degradation
⚡️Define "production-ready"
⚡️Abstraction is not OOP

#python
⚡️Use pathlib
⚡️Pip constraints files
⚡️.pth files

#soft_skills
⚡️Your top skill
⚡️Ask stupid questions
⚡️Let robots yell at people
⚡️Soft skills thoughts

#big_data
⚡️S3 vs HDFS
⚡️Famous in-memory data format
⚡️Complexity in distributed systems
⚡️Snowflake

#ml
⚡️ML system basic framework
⚡️AutoML
⚡️MLOps
⚡️Testing and validation in ML

Bigger posts I'm sharing on my small website ✌️
L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵ pinned «​​Hey everybody, Channel is growing so let me collect all the interesting posts from this channel that I think should get more attention: #dev ⚡️Technological degradation ⚡️Define "production-ready" ⚡️Abstraction is not OOP #python ⚡️Use pathlib ⚡️Pip constraints…»
​​Perhaps not every Python developer knows the interesting property of the CPython that makes newcomers go crazy:

>>> a = 255
>>> b = 255
>>> a == b
True
>>> a is b
True

The double equal operator checks that the objects values are equal and the is operator checks that the variables refer to the same object. a and b are different objects, so a is b will return False.

There is actually an optimization in Python regarding small integers (-5 to 256 inclusive). These objects are loaded into the interpreter's memory when the interpreter starts up. This results in a small internal cache. Because of this, the variables with the same values point to the same object and the result is True.

A similar example for number > 256 will work as expected:

>>> a = 257
>>> b = 257
>>> a == b
True
>>> a is b
False

#python
I first wanted to call the article Data Engineering terms, but then I thought that I was describing more than just terminology. We can think of those terms as architectural requirements that greatly affect the design, cost, and ultimate usefulness of the systems. Behind the marketing and noise, there are real problems that need to be tackled.

https://luminousmen.com/post/architecturally-significant-requirements
A collection of AWS related videos, podcasts, code repositories, whitepapers, and feature releases, all in a single, easy to search interface.

https://awsstash.com/

#aws
Btw this also work for strings:

>>> a = 'the'
>>> b = 'the'
>>> a == b
True
>>> a is b
True

As the strings are immutable it makes sense for the interpreter to store the string literal only once and point all the variables to the same object. It is called string interning. And it's kind of internal optimization technique for working with valid identifiers.

In Python 3.6, any string with length ≤ 20 will get interned. But Python 3.7 uses the AST optimizer and (most) strings up to 4096 characters are interned.

>>> a = 'the thing'
>>> b = 'the thing'
>>> a == b
True
>>> a is b
False

Wtf you may be wonder. This string is just not a valid identifier that's all. But we can make interning explicit:

>>> import sys
>>> b = sys.intern('the thing')
>>> a = sys.intern('the thing')
>>> a == b
True
>>> a is b
True

Why?

🔸 Saving memory
🔸 Fast comparisons
🔸 Fast dictionary lookups

#python
Delta Lake vs Iceberg vs Hudi vs Hive ACID

Many companies realized that while Hadoop was initially cheap to set up, it required hiring much more expensive engineers to write optimal codes, otherwise the user community could easily and quickly direct an inefficient workload to a big data system, and then this led to problems with SLA, scalability, and maintenance costs.

Hadoop-based approaches in most cases still cheaper than relying mostly on Redshift/Teradata/Exadata/Netezza/Vertica, but it is obvious why Lakehouse(new marketing term from Databricks) solutions like Hive ACID, Delta Lake, Apache Hudi, Iceberg started to appear.

▪️Among them, the Apache Spark great success in commercial, so behind by commercial companies, Databricks launched Delta Lake.

▪️Apache Hudi by Uber engineers to meet its internal data analysis needs designed a Data Lake, which provides fast upsert, delete, compaction, and other functions to accurately his pain points. Plus the members of the project's community sharing technical details and gradually attracting the attention of potential users. Btw Hudi has been integrated into EMR long ago, it is very attractive to use Hudi on AWS.

▪️Apache Iceberg moment would appear to be relatively mediocre from a marketing point of view, functionality is not that rich, but it is an ambitious project because it has high-level abstractions and a very elegant design.

So,

Delta Lake has the best momentum

Iceberg has the best design

Hudi has an awesome performance

Hive is unfortunately fading away

Source

#big_data