L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵
502 subscribers
157 photos
32 videos
2 files
701 links
(ノ◕ヮ◕)ノ*:・゚✧ ✧゚・: *ヽ(◕ヮ◕ヽ)

helping robots conquer the earth and trying not to increase entropy using Python, Data Engineering and Machine Learning

http://luminousmen.com

License: CC BY-NC-ND 4.0
Download Telegram
DataFrames can be partially cached, but partitions cannot be partially cached. When you use cache() or persist(), DataFrame is not fully cached until you invoke an action that goes through each record e.g. count().

So if an action like take(1) is used, only one partition will be cached, because Catalyst understands that you don't need to calculate all partitions just to get one record.

#spark
Let robots yell at people

As a technical lead, you are at your best as a productivity multiplier for your entire team. Anything you can do to increase your team's velocity has compounding effects for the company. Think automation/instrumentation/documentation. (rephrasing some twitter guy, not my words)

For example, often I am debugging the code with the built-in print statements in Python. Yes, I rarely use pdb and all that stuff and I don't suffer much. And sometimes I might forget to remove these debugging print lines from the PR. So as to avoid my colleagues constantly coming to me after a review and blaming me, we can create a linter rule for that. This helps not to screw up a relationship with anybody, and it is more effective to prevent repeated mistakes and to train newcomers.

What do you think?

#dev #management
I expect that Airflow for you is not just a word expressing the fart. Often Airflow DAGs become too big and complicated to understand. They get split between different teams within a company for future implementation and support. It may end up with a problem of incorporating different DAGs into one pipeline.

https://luminousmen.com/post/airflow-dag-dependencies

#big_data
The University of New York has made available a course on deep learning (https://cds.nyu.edu/deep-learning/) from Yann LeCun.

He is one of the creators of modern deep learning, a Turing award winner and former head of AI development on Facebook.
You don't need Hadoop

In the article the author writes that there is no need to use Hadoop on 600MB data. Despite the fact that the author made a clear emphasis on the size of data and his point is understandable, in my opinion, the size of data is insignificant in comparison with their structure and query patterns. Yes, Big Data is when the data don't fit on one machine. But not every task that requires horizontal scaling is big data.

So, if the data is trivial for processing and 100 terabytes for Hadoop may be small. It depends on what you do and how you do it.

Everything boils down to a trivial conclusion - use a tool suitable for the task.

#big_data
Apple released Tensorflow fork with support for a new chip M1 on macOS Big Sur. Thus, Tensorflow received a version for its own Apple processor, with support for hardware acceleration. Users can now get up to 7x faster training on the new 13-inch MacBook Pro with M1.

I like the idea of moving ML closer to the user, that makes a lot of sense especially for Apple.

Link

#ml
Evolution of trust

COVID is still brutal on the streets, but there is an epidemic that continues to rage even longer. Using game theory, the author tries to explain the epidemic of mistrust and seek a way to fix it.

https://ncase.me/trust/

#soft_skills
If you have been writing Spark applications for a long time, you couldn't help but come across the tuning of its configuration parameters. Now you can do it automatically with a tool that optimizes the cluster resources for you. Made by me for you.

http://spark-configuration.luminousmen.com/

#spark #big_data
Who here in any professional way connected or interested in DS/ML?
Anonymous Poll
27%
I'm ML engineer
35%
I'm Data Scientist
38%
I'm never did any, but interested
Preview version of Learning Spark with latest 3.0 release. I'm saving your time filling up the forms, enjoy the book.

#big_data #spark
​​When I got my first computer, I had a Windows system installed on it. I don't know about you, but every time there was an error in Windows that meant nothing to me I went to the lower Internet to solve it. And I found out that there are many more interesting things on the lower Internet...

Windows has created a layer of programmers who can solve problems. Thanks to her for that.

But Windows OS is still a pile of horse shit.

That's why you can buy a new year's ugly sweater with it: https://gear.xbox.com/pages/windows
​​Andreessen Horowitz published a detailed guide on the state of architectures for data infrastructure.

It covers data sources, data ingestion and transformation, storage, historical (for analytics), predictive and outputs along with different tools that can be used for each as well as case studies from several large companies on their data infrastructure setups.

Highly recommended.

https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/